Error Handling Strategies for Jobs
Jobs may not run as expected due to I/O errors in sources and sinks, or coding and input data errors in your pipelines. To handle these errors you can use one of the following strategies.
Sources and sinks are often the points of contact with external data stores. Their various types have specific characteristics which enable error handling strategies only applicable to them.
For example, the Change Data Capture sources can attempt to reconnect automatically whenever they lose connection to the databases they monitor, so for intermittent network failures, their owner jobs don’t need to fail.
To see the failure handling options of a given source or sink, consult its Javadoc.
For in-memory data structures, a large portion of error handling, such as communications failures with the remote cluster, is handled by the Hazelcast cluster. For further details and configuration options, see Handling Failures.
The primary way of dealing with errors in pipelines is to make all jobs independent of each other. In a cluster, there is an arbitrary number of independent jobs running in parallel. Hazelcast ensures that these jobs do not interact in any way, and one’s failure does not lead to any consequences for the others.
The concern of error handling becomes: what happens to a job once it encounters a problem. By default, Hazelcast fails a job that has thrown an error, but what one can do with such a failed job afterward is the interesting part.
For many streaming jobs, specifically the ones which don’t have any processing guarantee configured, the pipeline definition and the job config are the only parts we can identify as state, and those are immutable.
One option for dealing with failure in immutable-state jobs is simply restarting them (once the cause of the failure has been addressed). Restarted streaming jobs lacking mutable state can just resume processing the input data flow from the current point in time.
Batch jobs don’t strictly fall into this immutable-state category, but the generic, reliable way of dealing with their error in Hazelcast is also restarting them from the beginning and having them completely reprocess their input.
Streaming jobs with mutable state, those with a processing guarantee set, achieve fault tolerance by periodically saving recovery snapshots. When such a job fails, not only does its execution stop but also its snapshots get deleted. This makes it impossible to resume it without loss.
A job in the suspended state has preserved its snapshot, and you can resume it without loss once you have addressed the root cause of the failure.
In the open-source version of Hazelcast, this scenario is limited to fixing the input data by some external means and then simply resuming the job.
The Enterprise version of Hazelcast has the added option of job upgrades. In that case you can:
export the latest snapshot
update the pipeline, if needed, for example, to cope with unexpected data
resubmit a new job based on the exported snapshot and the updated pipeline
One caveat of the suspend-on-failure feature is that the latest snapshot is not a "failure snapshot". Hazelcast can’t take a full snapshot right at the moment of the failure, because the Jet engine can produce accurate snapshots only when in a healthy state. Instead, Hazelcast simply keeps the latest periodic snapshot it created. Even so, the recovery procedure preserves the at-least-once guarantee.