This is a prerelease version.

About Data Pipelines

Learn about the core concepts of data pipelines and how you can build your own with the Hazelcast Jet engine.

What is a Data Pipeline?

A data pipeline (also called a pipeline) is a series of steps for processing data, consisting of three elements:

  • One or more sources: Where you take your data from.

  • Processing stages: What you do to your data.

  • At least one sink: Where you send the results of the last processing stage.

A pipeline flowing from a source to a sink

Pipelines allow you to process data that’s stored in one location and send the result to another such as from a data lake to an analytics database, or into a payment processing system. You can also use the same source and sink such that the pipeline only processes data.

Depending on the data source, pipelines can be used for the following use cases:

  • Stream processing: Processes an endless stream of data such as events to deliver results as the data is generated.

  • Batch processing: Processes a finite amount of static data for repetitive tasks such as daily reports.

What is the Jet Engine?

The Jet engine is a batch and stream processing system that allows Hazelcast members to do both stateless and stateful computations over large amounts of data with consistent low latency.

Further from that, you can build a multistage cascade of groupBy operations, you can process infinite out-of-order data streams using event time-based windows, you can have forks in your data stream to reuse the same intermediate result in more than one way, you can build a pipeline where an I/O task distributes the processing of the data it reads across all CPU cores, etc. in short, all the advantages that a full-blown DAG computation engine offers.

Pipeline Workflow

A pipeline is a reusable object that can be passed around and submitted to the cluster several times.

The general workflow of a data pipeline includes the following steps:

  1. Read data from sources.

  2. Send data to at least one sink.

    A pipeline without any sinks is not valid.

To process or enrich data in between reading it from a source and sending it to a sink, you can use transforms.

Extracting and Ingesting Data

Hazelcast provides a variety of connectors for streaming data into pipelines and storing the results in sinks such as Hazelcast data structures, Java Messaging Service (JMS), JDBC systems, Apache Kafka, Hadoop Distributed File System, and TCP Sockets.

If a connector doesn’t already exist for a source or sink, Hazelcast provides a convenience API so you can easily build a custom one.

For details, see Sources and Sinks.

Running Data Pipelines in Hazelcast

Hazelcast members comes with a high-performance execution engine called the Jet engine. Using this engine, pipelines can process hundreds of thousands of records per second with millisecond latencies using a single member.

To have the Jet engine run a pipeline, it must be submitted to a member. At that point, the pipeline becomes a job.

When a member receives a job, it creates a computational model in the form of a Directed Acyclic Graph (DAG). Using this model, the computation is split into tasks that are distributed among all members in the cluster, making use of all available CPU cores.

This model allows Hazelcast to process data faster because the tasks can be replicated and executed in parallel across the cluster.

In Hazelcast, pipelines can be defined using either SQL or the Jet API.

SQL has the following limitations:

  • No window functions.

    You cannot group or aggregate results in streaming queries.

  • Fewer sources and sinks.

    You can use only a subset of the available sources and sinks in the Jet API.

  • Limited support for joins.

    If you want to use joins, the data source on the right of the join must be a map.

Fault-Tolerance and Scalability

To protect jobs from member failures, the Jet engine uses snapshots, which are regularly taken and saved in multiple replicas for resilience. In the event of a failure, a job is restarted from the most recent snapshot, delaying the job for only a few seconds rather than starting it again from the beginning.

To allow you to increase your processing power at will, Hazelcast clusters are elastic, allowing dynamic scaling to handle load spikes. You can add new members to the cluster with zero downtime to linearly increase the processing throughput.

Learn More

To get started with the Jet API, see Get Started with Stream Processing (Embedded).

To get started with SQL pipelines, see Get Started with SQL Pipelines.

For more information about fault tolerance, see Fault Tolerance.

For details about saving your own snapshots, see Updating Jobs.

For more general information about data pipelines and their architectures, see our glossary.