Recovery from a Partial or Total Failure

Under normal circumstances, Hazelcast members are self-recoverable as in the following scenarios:

Automatic split-brain resolution
Hazelcast allowing stuck/unreachable members to come back within configured tolerance levels; see the Configuring for Fault Tolerance section.

However, in the rare case when a member is declared unreachable by Hazelcast because it fails to respond, but the rest of the cluster is still running, consider the followings for recovery:

Collect Hazelcast logs from all members, active and unresponsive.
Collect Hazelcast client logs or application logs from all clients.
If the cluster is running and one or more members were ejected from the cluster because it was stuck, take a heap dump and thread-dump of any stuck members.
After collecting all of the necessary artifacts, shut down the ailing members by calling shutdown hooks; see the Shutting Down the Cluster section.
After shutdown, start the ailing members and wait for them to join the cluster. After successfully joining, Hazelcast rebalances the data across the members.

Recovery from a Partial or Total Failure

Send us your feedback

Help and support