CP Subsystem Persistence
Hazelcast IMDG Pro and Enterprise Feature
CP Subsystem Persistence Overview
CP Subsystem Persistence enables CP members to recover from crash scenarios.
This capability significantly improves the overall reliability of CP Subsystem.
When it is enabled via CPSubsystemConfig.setPersistenceEnabled(boolean)
, CP
members persist their local state to stable storage. When you restart the crashed
CP members, they restore their local state and resume working as if they have
never crashed. CP Subsystem Persistence enables you to handle single or
multiple CP member crashes, or even whole cluster crashes and guarantees that
committed operations are not lost after recovery. In other words, CP member
crashes and restarts do not create any consistency problem. As long as majority
of CP members are available after recovery, CP Subsystem remains operational.
Please see the CP Subsystem Configuration section for the configuration options of CP Subsystem Persistence.
When CP Subsystem Persistence is enabled, all Hazelcast cluster members create
a sub-directory under the base persistence directory which is specified via
CPSubsystemConfig.getBaseDir()
. This means that AP Hazelcast members, which
are the ones not marked as CP members during the CP discovery process, create
their persistence directories as well. Those members persist only
the information that they are not CP members. This is done because when
a Hazelcast member starts with CP Subsystem Persistence enabled, it checks if
there is a CP persistence directory belonging to itself. If it founds one, it
skips the CP discovery process and initializes its CP member identity from
the persisted data. If it was an AP member before shutdown or crash, it
restores this information and starts as an AP member. Otherwise, it could think
that the CP discovery process has not been executed and trigger it, which would
break CP Subsystem.
In light of this information, if you have both CP and AP members in your cluster when CP Subsystem Persistence is enabled, and if you want to perform a cluster-wide restart, you need to ensure that AP members are also restarted with their CP persistence directories. |
You can check the code sample below to see how CP Subsystem Persistence works
in general. In this code sample, we configure CP Subsystem with 3 CP members
and also enable CP Subsystem Persistence. We start 3 Hazelcast members with
this configuration and update a CP IAtomicLong
instance. Each member creates
a sub-directory for itself inside the default base CP Subsystem Persistence
directory and stores its local CP state there. Then, we terminate two of these
members as if they crash and restart only 1 of them back. When we fetch
the same IAtomicLong
instance from the restarted members and get its current
value, we see that it returns the update that we made before terminating these
members. Please note that we make sure that we have the majority of CP members
alive to keep CP Subsystem available after restart.
Config config = new Config();
config.setLicenseKey("your-license-key");
NetworkConfig networkConfig = config.getNetworkConfig();
JoinConfig join = networkConfig.getJoin();
TcpIpConfig tcpIpConfig = join.getTcpIpConfig();
tcpIpConfig.setEnabled(true);
tcpIpConfig.addMember("127.0.0.1");
// config.getCPSubsystemConfig().setCPMemberCount(3).setPersistenceEnabled(true);
HazelcastInstance instance1 = Hazelcast.newHazelcastInstance(config);
HazelcastInstance instance2 = Hazelcast.newHazelcastInstance(config);
HazelcastInstance instance3 = Hazelcast.newHazelcastInstance(config);
IAtomicLong counter = instance1.getCPSubsystem().getAtomicLong("counter");
counter.set(0);
counter.incrementAndGet();
instance1.getLifecycleService().terminate();
instance2.getLifecycleService().terminate();
instance1 = Hazelcast.newHazelcastInstance(config);
counter = instance1.getCPSubsystem().getAtomicLong("counter");
long val = counter.get();
assert val == 1L;
CP Subsystem Persistence Behavior During CP Subsystem Reset
If the majority of CP members are permanently lost, CP Subsystem becomes
unavailable. There is no solution to recover from this failure case with strong
consistency guarantee. CP Subsystem Management API contains a method to delete
all CP Subsystem state on the remaining CP members and start from scratch.
CPSubsystemManagementService.reset()
wipes and resets the whole CP
Subsystem state and initializes it as if the Hazelcast cluster is starting up
for the first time. This method deletes the persisted CP member states as well.
Interaction with Hot Restart Persistence
Hazelcast offers another persistence capability which is called
Hot Restart Persistence. Hot Restart Persistence
is used for restarting a cluster with large AP data after a planned cluster
shutdown or a whole cluster-wide crash. Please note that CP Subsystem
Persistence and Hot Restart Persistence are separate features with different
behaviors and reliability guarantees. For instance, CP Subsystem Persistence
guarantees that committed operations will be restored and the linearizability
semantics of the CP Subsystem data structures will be preserved on restarts.
However, Hot Restart Persistence may lose some of the acknowledged updates on
AP data structures, based on how you configure the fsync
behavior for your
persisted AP data structures. Moreover, if you store AP and CP data in a single
Hazelcast cluster and use both of the persistence features, Hazelcast member
restarts or cluster restarts can fail because of the Hot Restart Persistence
recovery semantics, even if the CP Subsystem Persistence recovery procedure is
successful, or vice-versa.