Ingesting Data from External Sources
Explore the tools that Hazelcast offers for data ingestion from both on-premise systems as well as cloud deployments.
What is Data Ingestion
Data ingestion is the process of importing data from external systems such as databases, files, or messaging systems.
With Hazelcast, you can ingest data from on-premise systems as well as cloud deployments and in a variety of data formats.
Available Tools
Hazelcast offers the following tools for ingesting data, depending on your use case:
-
Pipeline: Import data from external sources using their native protocols and use Hazelcast data structures as a sink.
You can also use the
SINK INTO
orINSERT INTO
SQL statements to ingest data straight into Hazelcast maps. -
MapLoader/MapStore: Import a subset of a larger dataset in memory with the option to later write it back to the data source and keep it synchronized.
Pipeline | MapLoader/MapStore | |
---|---|---|
Can import data from multiple data sources |
Yes |
No |
Can export data into multiple data sinks |
Yes |
No |
Ingestion can be canceled or restarted |
Yes |
No |
Supported Hazelcast data structures |
|
|
Can keep data synchronized with the data source |
No |
Yes |
Supports streaming real-time data |
Yes |
No |
Connectivity |
Out-of-the box connectors or build your own connector |
Java interface for building your own custom connector |
Supported format of data sources |
Any data format that is supported by the available connectors |
Key-value pairs or you can map the data formats to key-value pairs, using custom Java code |
Method for pre-processing or enriching data |
Out-of-the box APIs such as |
Java interface for writing your own custom implementations |
When to Use a Pipeline
This section discusses the scenarios in which you might want to use a pipeline to ingest data into Hazelcast.
Out-Of-The-Box Connectivity
You can compose pipelines from the provided building blocks using either SQL or Java SDK. Hazelcast comes with many out-of-the-box connectors including:
-
Apache Kafka
-
Amazon Kinesis
-
Amazon S3
-
Azure Blob Storage
-
Filesystem
-
Google Cloud Storage
-
HDFS
-
JMS
-
JDBC data sources
See the full list of available connectors. If a connector is not available for your data source or data sink, you can write your own.
Composable ETL
With pipelines, you can extract, transform, load, and combine data from multiple data sources without the need for third-party tools or middleware. Hazelcast executes pipelines in a robust, and highly performant manner.
For examples of how to use ETL pipelines, see Extract Transform Load (ETL).
Multiple Data Sources
If your data is stored in one or multiple data sources that are often updated, it’s best to stream that data into Hazelcast using a pipeline. This way, you can always be sure that you’re processing recent data.
When to use a MapLoader/MapStore
This section discusses the scenarios in which you might want to use a MapStore to ingest data into Hazelcast.
Read and Write-Through Caching
If your data is stored in a data source that is too slow to query, it’s best to cache that data in Hazelcast. Such a data source is usually some kind of database or other kind of key-value storage. MapLoader and MapStore is a tool for keeping a Hazelcast cache in sync with the data source.
With MapLoader/MapStore, you can do the following: Fetch missing records from the data source in reaction to cache misses Push cache changes back to the original data source Hydrate the cache upon startup to prevent poor performance caused by many cache misses Pre-process or enrich data in real time before ingesting it by writing custom Java code
Related Resources
To learn more about pipelines, see the following resources:
To learn more about MapLoader/MapStore, see the following resources:
-
MongoDB example (community)