Making CP Members Fault Tolerant
|To use this feature, you need an Enterprise license.|
CP Subsystem Persistence enables CP members to recover from crash scenarios. This capability significantly improves the overall reliability of CP Subsystem.
When it is enabled via
members persist their local state to stable storage. When you restart the crashed
CP members, they restore their local state and resume working as if they have
never crashed. CP Subsystem Persistence enables you to handle single or
multiple CP member crashes, or even whole cluster crashes and guarantees that
committed operations are not lost after recovery. In other words, CP member
crashes and restarts do not create any consistency problem. As long as majority
of CP members are available after recovery, CP Subsystem remains operational.
Please see the CP Subsystem Configuration section for the configuration options of CP Subsystem Persistence.
When CP Subsystem Persistence is enabled, all Hazelcast cluster members create
a sub-directory under the base persistence directory which is specified via
CPSubsystemConfig.getBaseDir(). This means that AP Hazelcast members, which
are the ones not marked as CP members during the CP discovery process, create
their persistence directories as well. Those members persist only
the information that they are not CP members. This is done because when
a Hazelcast member starts with CP Subsystem Persistence enabled, it checks if
there is a CP persistence directory belonging to itself. If it founds one, it
skips the CP discovery process and initializes its CP member identity from
the persisted data. If it was an AP member before shutdown or crash, it
restores this information and starts as an AP member. Otherwise, it could think
that the CP discovery process has not been executed and trigger it, which would
break CP Subsystem.
|In light of this information, if you have both CP and AP members in your cluster when CP Subsystem Persistence is enabled, and if you want to perform a cluster-wide restart, you need to ensure that AP members are also restarted with their CP persistence directories.|
You can check the code sample below to see how CP Subsystem Persistence works
in general. In this code sample, we configure CP Subsystem with 3 CP members
and also enable CP Subsystem Persistence. We start 3 Hazelcast members with
this configuration and update a CP
IAtomicLong instance. Each member creates
a sub-directory for itself inside the default base CP Subsystem Persistence
directory and stores its local CP state there. Then, we terminate two of these
members as if they crash and restart only 1 of them back. When we fetch
IAtomicLong instance from the restarted members and get its current
value, we see that it returns the update that we made before terminating these
members. Please note that we make sure that we have the majority of CP members
alive to keep CP Subsystem available after restart.
Config config = new Config(); config.setLicenseKey("your-license-key"); NetworkConfig networkConfig = config.getNetworkConfig(); JoinConfig join = networkConfig.getJoin(); TcpIpConfig tcpIpConfig = join.getTcpIpConfig(); tcpIpConfig.setEnabled(true); tcpIpConfig.addMember("127.0.0.1"); config.getCPSubsystemConfig().setCPMemberCount(3).setPersistenceEnabled(true); HazelcastInstance instance1 = Hazelcast.newHazelcastInstance(config); HazelcastInstance instance2 = Hazelcast.newHazelcastInstance(config); HazelcastInstance instance3 = Hazelcast.newHazelcastInstance(config); IAtomicLong counter = instance1.getCPSubsystem().getAtomicLong("counter"); counter.set(0); counter.incrementAndGet(); instance1.getLifecycleService().terminate(); instance2.getLifecycleService().terminate(); instance1 = Hazelcast.newHazelcastInstance(config); counter = instance1.getCPSubsystem().getAtomicLong("counter"); long val = counter.get(); assert val == 1L;
If the majority of CP members are permanently lost, CP Subsystem becomes
unavailable. There is no solution to recover from this failure case with strong
consistency guarantee. CP Subsystem Management API contains a method to delete
all CP Subsystem state on the remaining CP members and start from scratch.
CPSubsystemManagementService.reset() wipes and resets the whole CP
Subsystem state and initializes it as if the Hazelcast cluster is starting up
for the first time. This method deletes the persisted CP member states as well.
Hazelcast also offers Persistence (AP Persistence) for protecting data in maps and Jcache data structures from planned cluster shutdowns and cluster-wide crashes.
Data stored in AP Persistence may lose some of the acknowledged updates on
AP data structures, based on how you configure the
fsync behavior for your
persisted AP data structures.
If you store AP and CP data in a single Hazelcast cluster and use AP Persistence and CP Subsystem Persistence, Hazelcast member restarts or cluster restarts can fail because of the AP Persistence recovery semantics, even if the CP Subsystem Persistence recovery procedure is successful, or vice-versa.