Ingesting Data from External Sources

Explore the Hazelcast components for ingesting data from both on-premise systems as well as cloud deployments.

What is Data Ingestion

Data ingestion is the process of importing data from external systems, such as databases, files, or messaging systems.

With Hazelcast, you can ingest data in a variety of formats from on-premise systems as well as cloud deployments.

Available Components

Choose from the following components for ingesting data, depending on your use case:

Pipeline: Import data from external systems using their native protocols and use Hazelcast data structures as a sink. Built-in Hazelcast connectors and pre-built Kafka Connect Source connectors make this task easier.

You can also use the SINK INTO or INSERT INTO SQL statements to ingest data straight into Hazelcast maps.
MapStore: Import a subset of a larger dataset into memory with the option to write it back to the data source later. Writing back to the data source allows you to keep your original dataset synchronized with data changes made on the cluster.

Table 1. Comparison of Pipeline and MapStore
	Pipeline	MapStore
Can import data from multiple data sources	Yes	No
Can export data into multiple data sinks	Yes	No
Ingestion can be canceled or restarted	Yes	No
Supported Hazelcast data structures	`IMap` `ReplicatedMap` `IList` `ISet` `ICache` `CP Subsystem`	`IMap`
Can keep data synchronized with the data source	No	Yes
Supports streaming real-time data	Yes	No
Connectivity	Built-in and pre-built connectors. Or, build your own custom connector.	Either use the pre-built generic MapLoader/Mapstore or implement Java interfaces to build your own custom connector
Supported format of data sources	Any data format that is supported by the available connectors	Key-value pairs or you can map the data formats to key-value pairs, using custom Java code
Method for pre-processing or enriching data	Out-of-the box APIs such as `aggregate()` for transforming data	Java interface for writing your own custom implementations

When to Use a Pipeline

This section discusses the scenarios in which you might want to use a pipeline to ingest data into Hazelcast.

Out-Of-The-Box Connectivity

You can compose pipelines from the provided building blocks using either SQL or Java SDK.

Hazelcast comes with many built-in connectors, including:

Apache Kafka
Amazon Kinesis
Amazon S3
Azure Blob Storage
Filesystem
Google Cloud Storage
HDFS
JMS
JDBC data sources

You can also use pre-built Kafka Connect Source connectors to import a stream of data into a pipeline from a non-Kafka external system. Kafka Connect Source connectors are available for many popular platforms, including Neo4j and Couchbase. There is also a Kafka Connect Source connector for integrating JDBC data sources.

See the full list of available connectors. If a connector is not available for your data source or data sink, you can write your own.

Composable ETL

With pipelines, you can extract, transform, load, and combine data from multiple data sources without the need for third-party tools or middleware. Hazelcast executes pipelines in a robust, and highly performant manner.

For examples of how to use ETL pipelines, see Extract Transform Load (ETL).

Multiple Data Sources

If your data is stored in one or multiple data sources that are often updated, it’s best to stream that data into Hazelcast using a pipeline. This way, you can always be sure that you’re processing recent data.

When to use a MapStore

This section discusses the scenarios in which you might want to use a MapStore to ingest data into Hazelcast.

Read and Write-Through Caching

If your data is stored in a data source that is too slow to query, it’s best to cache that data in Hazelcast. Such a data source is usually some kind of database or other kind of key-value storage. MapStore is a tool for keeping a Hazelcast cache in sync with the data source.

With MapStore, you can do the following:

Fetch missing records from the data source in reaction to cache misses.
Push cache changes back to the original data store.
Hydrate the cache upon startup to prevent poor performance caused by many cache misses.
Pre-process or enrich data in real time before ingesting it by writing custom Java code.

Read-Through Caching Only

If you only need to fetch missing records from the data source in reaction to cache misses, consider using the generic MapLoader.

To ingest and cache data in Hazelcast as part of a one-time operation, use a pipeline, SINK INTO or INSERT INTO SQL statements instead.

To learn more about pipelines, see the following resources:

To learn more about MapStore, see the following resources:

Ingesting Data from External Sources

What is Data Ingestion

Available Components

When to Use a Pipeline

Out-Of-The-Box Connectivity

Composable ETL

Multiple Data Sources

When to use a MapStore

Read and Write-Through Caching

Read-Through Caching Only

Send us your feedback

Help and support

Ingesting Data from External Sources

What is Data Ingestion

Available Components

When to Use a Pipeline

Out-Of-The-Box Connectivity

Composable ETL

Multiple Data Sources

When to use a MapStore

Read and Write-Through Caching

Read-Through Caching Only

Related Resources

Send us your feedback

Help and support