Making CP Members Fault Tolerant

To use this feature, you need an Enterprise license.

CP Subsystem Persistence enables CP members to recover from crash scenarios. This capability significantly improves the overall reliability of CP Subsystem.

When it is enabled via CPSubsystemConfig.setPersistenceEnabled(boolean), CP members persist their local state to stable storage. When you restart the crashed CP members, they restore their local state and resume working as if they have never crashed. CP Subsystem Persistence enables you to handle single or multiple CP member crashes, or even whole cluster crashes and guarantees that committed operations are not lost after recovery. In other words, CP member crashes and restarts do not create any consistency problem. As long as majority of CP members are available after recovery, CP Subsystem remains operational.

Please see the CP Subsystem Configuration section for the configuration options of CP Subsystem Persistence.

When CP Subsystem Persistence is enabled, all Hazelcast cluster members create a sub-directory under the base persistence directory which is specified via CPSubsystemConfig.getBaseDir(). This means that AP Hazelcast members, which are the ones not marked as CP members during the CP discovery process, create their persistence directories as well. Those members persist only the information that they are not CP members. This is done because when a Hazelcast member starts with CP Subsystem Persistence enabled, it checks if there is a CP persistence directory belonging to itself. If it founds one, it skips the CP discovery process and initializes its CP member identity from the persisted data. If it was an AP member before shutdown or crash, it restores this information and starts as an AP member. Otherwise, it could think that the CP discovery process has not been executed and trigger it, which would break CP Subsystem.

In light of this information, if you have both CP and AP members in your cluster when CP Subsystem Persistence is enabled, and if you want to perform a cluster-wide restart, you need to ensure that AP members are also restarted with their CP persistence directories.

You can check the code sample below to see how CP Subsystem Persistence works in general. In this code sample, we configure CP Subsystem with 3 CP members and also enable CP Subsystem Persistence. We start 3 Hazelcast members with this configuration and update a CP IAtomicLong instance. Each member creates a sub-directory for itself inside the default base CP Subsystem Persistence directory and stores its local CP state there. Then, we terminate two of these members as if they crash and restart only 1 of them back. When we fetch the same IAtomicLong instance from the restarted members and get its current value, we see that it returns the update that we made before terminating these members. Please note that we make sure that we have the majority of CP members alive to keep CP Subsystem available after restart.

        Config config = new Config();
        NetworkConfig networkConfig = config.getNetworkConfig();
        JoinConfig join = networkConfig.getJoin();
        TcpIpConfig tcpIpConfig = join.getTcpIpConfig();

        HazelcastInstance instance1 = Hazelcast.newHazelcastInstance(config);
        HazelcastInstance instance2 = Hazelcast.newHazelcastInstance(config);
        HazelcastInstance instance3 = Hazelcast.newHazelcastInstance(config);

        IAtomicLong counter = instance1.getCPSubsystem().getAtomicLong("counter");


        instance1 = Hazelcast.newHazelcastInstance(config);

        counter = instance1.getCPSubsystem().getAtomicLong("counter");

        long val = counter.get();
        assert val == 1L;

CP Subsystem Persistence Behavior During CP Subsystem Reset

If the majority of CP members are permanently lost, CP Subsystem becomes unavailable. There is no solution to recover from this failure case with strong consistency guarantee. CP Subsystem Management API contains a method to delete all CP Subsystem state on the remaining CP members and start from scratch. CPSubsystemManagementService.reset() wipes and resets the whole CP Subsystem state and initializes it as if the Hazelcast cluster is starting up for the first time. This method deletes the persisted CP member states as well.

Using CP Subsystem Persistence with AP Persistence

Hazelcast also offers Persistence (AP Persistence) for protecting data in maps and Jcache data structures from planned cluster shutdowns and cluster-wide crashes.

Data stored in AP Persistence may lose some of the acknowledged updates on AP data structures, based on how you configure the fsync behavior for your persisted AP data structures.

If you store AP and CP data in a single Hazelcast cluster and use AP Persistence and CP Subsystem Persistence, Hazelcast member restarts or cluster restarts can fail because of the AP Persistence recovery semantics, even if the CP Subsystem Persistence recovery procedure is successful, or vice-versa.