This is a prerelease version.

View latest

Handling Member Failures with Persistence Enabled

Cluster members that have persisted data on disk handle member failures differently, depending on whether members fail during a cluster restart or if a single member leaves the cluster. You can configure how your clusters handle each scenario.

Glossary

Term Definition

data load timeout

Number of seconds that the cluster allows for members to finish restoring data from their local persistence store. See the data-load-timeout-seconds configuration option.

master member

The oldest member in a cluster.

partition table

Table that stores members' partition replica assignments and the partition table version.

persistence store

The files that contain persisted data.

validation timeout

Number of seconds that the cluster allows for members to rejoin and send their partition table to the master member. See the validation-timeout-seconds configuration option

What Happens When a Single Member Leaves the Cluster

By default, if a cluster detects that a missing member is restarting and attempting to rejoin, the cluster’s master member will ask the rejoining member to send its partition table for validation.

The rejoining member loads its persisted data from disk and rejoins the cluster only if the following conditions are met:

  • The master member receives the partition table within 120 seconds (the validation timeout).

  • The master member validates that the partition table was correct at the time that the rejoining member left the cluster.

  • The rejoining member loads its persisted data from disk within 900 seconds (the data load timeout).

If these conditions are not met, the rejoining member deletes its persistence store, generates a new UUID, and rejoins the cluster. The rest of the cluster then tries to recover the missing member’s data from backups and repartitions it.

Flowchart that shows the process a cluster follows when a single member fails and leaves the cluster

If you do not want rejoining members to automatically delete their persistence stores when their partition tables are invalid, set the auto-remove-stale-data option to false.

Only in exceptional circumstances will the master member consider a rejoining member’s data stale. Instead, the member will reload its data from disk and then synchronize it with other cluster members as needed.

If you have lots of persisted data and you are concerned about how long it may take for your cluster to repartition after a member fails to rejoin, you can delay repartitioning.

Do not configure a delay if your cluster also stores in-memory data that is not persisted. Clusters do not repartition their data if a member rejoins within the delay. As a result, any data that is not persisted will be lost if the member restarts within the delay, including backups.

What Happens When the Whole Cluster Restarts

By default, a restarting cluster waits until the following conditions are met:

  • The master member receives all members' partition tables within the validation timeout.

  • The master member validates that all members' partition tables are correct.

  • All members load their persisted data from disk within the data load timeout.

The cluster fails to start if any of these conditions are not met.

Flowchart that shows the process a cluster follows when one or more members fail to start after a full cluster restart

To handle cases where the master member rejects a rejoining member because of invalid partitions, you can use force start to delete all data from every member’s persistence store and restart the cluster.

To allow the cluster to restart without all cluster members, you can configure the cluster to use partial start.

If you have lots of persisted data and you are concerned about how long it may take for your cluster to repartition after a member fails to rejoin, you can delay repartitioning.

Do not configure a delay if your cluster also stores in-memory data that is not persisted. Clusters do not repartition their data if a member rejoins within the delay. As a result, any data that is not persisted will be lost if the member restarts within the delay, including backups.

Force Start

Force start is a procedure where all cluster members delete their persistence stores and generate new UUIDs. By default, clusters are configured to wait until all members have restarted after a whole cluster restart. But, during a cluster restart, some members can crash permanently and then be unable to recover from the failure. At this point, the cluster cannot start until all running members have deleted their persistence store and generated new UUIDs.

Force start deletes all data in your cluster members' persistence stores.

Consider the following valid scenario to use force start:

  • You have a cluster consisting of members A and B which is initially stable.

  • Cluster gracefully shuts down.

  • Member A restarts, but member B does not.

  • Member A waits for member B to join, which never happens.

  • Now you have the choice to Force Start the cluster without member B.

  • Cluster discards all data and starts empty.

To trigger a force start use one of the following options:

To allow a cluster to restart without all members, see partial start.

Partial Start

Partial start is a procedure where a cluster starts without all members. Data belonging to those missing members is assumed lost and Hazelcast tries to recover missing data, using the restored backups. For example, if you have a minimum of two backups configured for all maps, then having at most two missing members will be safe against data loss. If more than two members are missing or maps have fewer than two backups, then data loss is expected.

To enable partial start, configure one of the following restart strategies in the cluster-data-recovery-policy option:

  • PARTIAL_RECOVERY_MOST_RECENT: Starts the cluster with the members that have the most up-to-date partition table and successfully loaded their persisted data. All other members leave the cluster and force start themselves. If no members load their persisted data, the cluster start fails.

  • PARTIAL_RECOVERY_MOST_COMPLETE: Starts the cluster with the largest group of members that have the same partition table version and successfully loaded their persisted data. All other members leave the cluster and force start themselves. If no members load their persisted data, the cluster start fails.

When partial start is enabled, Hazelcast can perform a partial start automatically or manually, in cases where some members are unable to restart successfully.

After the validation timeout has expired, Hazelcast performs a partial start automatically with the members that have either the most recent or most complete partition table and that loaded their persisted data within the data load timeout.

To trigger a manual partial start, use one of the following options before the validation-timeout-seconds duration expires:

Delaying Repartitioning

You can make a cluster wait for a period of time before repartitioning after one or more members fail to rejoin. When a cluster stores lots of persisted data, it may take a long time to repartition the data after a member leaves the cluster. But, you may expect members to shut down and restart quickly, in which case the cluster doesn’t need to repartition the data as soon as a member leaves. You can delay repartitioning for as long as you expect members to rejoin the cluster.

For example, you may want to delay repartitioning for the following scenarios:

  • You’re carrying out a planned shutdown.

  • You’re running a cluster on Kubernetes and expect members to be restarted quickly.

To delay repartitioning during a single member failure, configure a rebalance delay, using the rebalance-delay-seconds option.

If your cluster also stores in-memory data that is not persisted, do not configure a rebalance delay. Clusters do not repartition in-memory data if a member rejoins within the delay. As a result, any data that is not persisted will be lost if the member restarts within the delay, including backups.
  • XML

  • YAML

  • Java

<hazelcast>
  <persistence enabled="true">
    <rebalance-delay-seconds>
      240
    </rebalance-delay-seconds>
  </persistence>
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    rebalance-delay-seconds: 240
Config config = new Config();

PersistenceConfig PersistenceConfig = new PersistenceConfig()
.setEnabled(true)
.setRebalanceDelaySeconds(240);

config.setPersistenceConfig(PersistenceConfig);

Consider the following scenario:

  • A cluster consists of members A, B, and C with persistence enabled.

  • Member B is killed.

  • Member B restarts.

If member B restarts within the rebalance delay, all its persisted data will be restored from disk, and the cluster will not repartition its data. Any in-memory data in member B’s partitions will be lost, and member B will still be listed as the owner of those partitions. So, even if the cluster has backups of in-memory data in maps, requests for that data will go to member B (unless the members have backup reads enabled).

If members have backup reads enabled, some in-memory data may appear to have been kept. However, eventually the backups will be synchronized with the primary partition (member B).

While the member is down, operations on partitions that are owned by that member will be retried until they either time out or the member restarts and executes the requested operation. As a result, this option is best when you prefer a latency spike rather than migrating data over the network.

If member B does not restart within the rebalance delay, the cluster recovers member B’s data from backups and repartitions the data among the remaining members (members A and C in this case). If member B is later restarted, it recovers its persisted data from disk and brings it up-to-date with data from members A and C. If Merkle trees are enabled on available data structures, members use those to request only missing persisted data. For details about how members use Merkle trees, see Synchronizing Persisted Data Faster.

Synchronizing Persisted Data Faster

When a failed member rejoins the cluster, it populates its in-memory stores with data from disk that may be stale. If you have lots of persisted data as well as in-memory data that you don’t want to lose, you can configure your data structures to generate a Merkle tree. The Merkle tree stores the state of persisted data in a way that other cluster members can quickly read, compare with their own, and check the delta for what is missing. This way, after a restart, the member can send its Merkle tree to the cluster and request only the missing data, reducing the amount of data sent over the network.

On map and JCache data structures, you can configure the following options to enable the Merkle tree.

Option Description Default Example

merkle-tree.enabled

Whether a Merkle tree is generated for the data structure.

enabled

  • XML

  • YAML

  • Java

<hazelcast>
  <map name="test-map">
    <data-persistence enabled="true">
    </data-persistence>
    <merkle-tree enabled="true">
    </merkle-tree>
  </map>
</hazelcast>
hazelcast:
  map:
  test-map:
    data-persistence:
      enabled: true
    merkle-tree:
      enabled: true
Config config = new Config();

MapConfig mapConfig = config.getMapConfig("test-map");
mapConfig.getDataPersistenceConfig().setEnabled(true)
mapConfig.getMerkleTreeConfig().setEnabled(true);

config.addMapConfig(mapConfig);

merkle-tree.depth

The depth of the Merkle tree.

The deeper the tree, the more accurate the difference detection but the more space is needed to store the Merkle tree in memory.

10

  • XML

  • YAML

  • Java

<hazelcast>
  <map name="test-map">
    <data-persistence enabled="true">
    </data-persistence>
    <merkle-tree enabled="true">
      <depth>
        12
      </depth>
    </merkle-tree>
  </map>
</hazelcast>
hazelcast:
  map:
  test-map:
    data-persistence:
      enabled: true
    merkle-tree:
      enabled: true
      depth: 12
Config config = new Config();

MapConfig mapConfig = config.getMapConfig("test-map");
mapConfig.getDataPersistenceConfig().setEnabled(true)
mapConfig.getMerkleTreeConfig().setEnabled(true);
mapConfig.getMerkleTreeConfig().setDepth(12);

config.addMapConfig(mapConfig);