Recovery from a Partial or Total Failure

Under normal circumstances, Hazelcast members are self-recoverable as in the following scenarios:

However, in the rare case when a member is declared unreachable by Hazelcast because it fails to respond, but the rest of the cluster is still running, consider the followings for recovery:

  • Collect Hazelcast logs from all members, active and unresponsive.

  • Collect Hazelcast client logs or application logs from all clients.

  • If the cluster is running and one or more members were ejected from the cluster because it was stuck, take a heap dump and thread-dump of any stuck members.

  • After collecting all of the necessary artifacts, shut down the ailing members by calling shutdown hooks; see the Shutting Down the Cluster section.

  • After shutdown, start the ailing members and wait for them to join the cluster. After successfully joining, Hazelcast rebalances the data across the members.