Recovery from a Partial or Total Failure
Under normal circumstances, Hazelcast members are self-recoverable as in the following scenarios:
-
Automatic split-brain resolution
-
Hazelcast allowing stuck/unreachable members to come back within configured tolerance levels; see the Configuring for Fault Tolerance section.
However, in the rare case when a member is declared unreachable by Hazelcast because it fails to respond, but the rest of the cluster is still running, consider the followings for recovery:
-
Collect Hazelcast logs from all members, active and unresponsive.
-
Collect Hazelcast client logs or application logs from all clients.
-
If the cluster is running and one or more members were ejected from the cluster because it was stuck, take a heap dump and thread-dump of any stuck members.
-
After collecting all of the necessary artifacts, shut down the ailing members by calling shutdown hooks; see the Shutting Down the Cluster section.
-
After shutdown, start the ailing members and wait for them to join the cluster. After successfully joining, Hazelcast rebalances the data across the members.