CP Subsystem
CP Subsystem is a component of a Hazelcast cluster that builds a strongly consistent layer for a set of distributed data structures. As well as network partitions, the CP Subsystem withstands server and client failures.
CP Subsystem APIs can be used for the following:
-
Implementing distributed coordination use cases, such as leader election
-
Distributed locking
-
Synchronization
-
Metadata management
Glossary
Term | Definition |
---|---|
caller |
A Hazelcast member or client that interacts with the CP Subsystem through APIs. |
CP Data Structures
Data structures in the CP subsystem always maintain linearizability and prefer consistency over availability during network partitions. These data structures are CP with respect to the CAP theorem.
CP Subsystem supports the following Hazelcast data structures:
CP Members
Because CP data structures do not maintain large states, not all members of a Hazelcast cluster need to take part in the CP Subsystem. The members that do take part in the CP Subsystem are called CP members.
To enable the CP Subsystem, you need to configure the number of CP members in the cluster. When a cluster is configured with N
CP members, the first N
members to start form the CP Subsystem.
After CP Subsystem is initialized, more CP members can be added at runtime and the number of active CP members can go beyond the configured CP member count. The number of CP members can be smaller than the total size of the Hazelcast cluster. For instance, you can run 5 CP members in a Hazelcast cluster of 20 members.
CP members can also contain data for the AP data structures, such as map and set.
When a member becomes a CP member, it generates an additional UUID that other CP members can use to identify it. You will see this CP UUID in the following places:
-
Requests to REST endpoints in the CP group
-
Responses from REST endpoints in the CP group
-
Member logs
-
Management Center
Unsafe Mode
When a cluster is not configured with CP members, the CP Subsystem is disabled and any CP data structures operate in unsafe mode.
Hazelcast uses unsafe mode by default. |
In this mode, the following restrictions apply:
-
CP data structures use partitioning and lazy replication mechanisms instead of the Raft consensus mechanism.
-
CP Subsystem Management APIs are not available.
-
Split-brain protection is not supported.
Unsafe mode provides weaker consistency guarantees than the CP Subsystem. For example, when you increment an atomic long in unsafe mode, even though you may receive a success response, operations such as increments can be lost if a member fails.
As a result, unsafe mode is not recommended for use cases that require strong consistency. Instead, this mode is suitable only for development or testing.
CP Groups
CP data structures are stored in CP groups. The CP Subsystem includes two CP groups by default:
-
METADATA
: An internal CP group, which is responsible for managing CP members and CP groups. This group is initialized during cluster startup when the CP Subsystem is enabled. -
DEFAULT
: The default group that CP data structures are stored in when no specific group name is given when they are created. See CP Data Structures.
Majority and Minority
Operations in the CP Subsystem are committed and executed only after they are successfully replicated to the majority of CP members in a CP group. As a result, a CP group must maintain a majority of CP members to continue working.
For example, when 6 accessible CP members are available and the configured CP member count is 7, the minority is 3 members and the majority is 4 members.
To configure the number of CP members that participate in a group, use the group-size
option.
Leaders and Followers
Each CP group elects its own Raft leader, which runs the Raft consensus algorithm. All other CP members in the group then become followers.
A Raft leader is responsible for handling incoming requests from callers and replicating those requests to follower members. To maintain the authority of the Raft leader and to help lagging CP group members to make progress, each CP group uses internal heartbeats.
Each CP member can participate in more than one CP group. However, if a CP member is the leader of too many CP groups compared to other CP members, it can turn into a bottleneck. Therefore, the CP Subsystem runs a periodic background task to try to make sure that each CP member is a leader for an equal number of CP groups. For example, a cluster has three CP members and three CP groups, each CP member becomes the Raft leader for only one CP group. If one more CP group is created, then one of the CP members becomes the Raft leader for two CP groups.
You can also give CP members a priority rating to affect their chances of becoming a leader. See Configuring Leadership Priority.
CP Sessions
For CP data structures that involve resource ownership management, such as fenced locks or semaphores, sessions are required to keep track of liveliness of callers.
A caller initially creates a session before sending its first request to the CP group, such as to acquire a fenced lock. After creating a session on the CP group, the caller stores its session ID locally and sends it alongside its session-based operations. A single session is used for all lock and semaphore proxies of the caller.
When a CP group receives a session-based operation, it checks the validity of the session using the session ID information available in the operation. A session is valid if it is still open in the CP group.
An operation with a valid session ID is accepted as a new session heartbeat.
To keep its session alive, a caller commits a periodic heartbeat to the CP group in the background.
A session is closed when the caller does not touch the session during a configurable duration. In this case, the caller is assumed to be crashed and all its resources are released automatically.
CP Member Discovery
When CP members start, they initiate a discovery process to find each other. Other Hazelcast members skip this process.
The CP discovery process runs out of the box without requiring any custom
configuration for different environments. It is completed when each CP member initializes its local CP member list and
commits it to the METADATA
CP group. A soon-to-be CP member terminates
itself if any of the following conditions occur before the CP discovery process
is completed:
-
Any Hazelcast member leaves the cluster
-
The local Hazelcast member commits a CP member list which is different from other members' committed CP member lists
-
The local Hazelcast member fails to commit its discovered CP member list for any reason.
When CP Subsystem is reset, the CP discovery process is triggered again. However, it does not terminate Hazelcast members if a soon-to-be CP member terminates itself, because Hazelcast members are likely to contain data for AP data structures and their termination can cause data loss. Hence, you need to observe the cluster and check if the CP discovery process completes successfully on the CP Subsystem reset. See the CP Subsystem Management APIs section for more details.
Fault Tolerance
By default, the CP Subsystem works only in memory without persisting any state to disk. This means that a crashed CP member is not able to rejoin the cluster by restoring its previous state. Therefore, crashed CP members increase the risk of gradually losing the majority of CP groups and eventually the total loss of the CP Subsystem. To prevent such situations, crashed CP members can be removed from the CP Subsystem and replaced in CP groups with other available CP members. This flexibility provides a good degree of fault tolerance at runtime.
Persistence
Enterprise Edition
By default, CP Subsystem works in memory without persisting any state to disk. As a result, a crashed CP member cannot recover by reloading its previous state. Therefore, crashed CP members may lead to gradually losing the majority of CP groups and eventually the total loss of availability of CP Subsystem. To prevent such situations, CP Subsystem Persistence can be enabled in the member configuration to make CP members persist their local CP state to stable storage.
CP Subsystem Persistence enables CP members to recover from member or cluster-wide crashes. As long as a majority of CP members are available after the recovery, the CP Subsystem remains operational, and guarantees that no committed operations are lost after recovery. When you restart a majority of CP members, they restore their local state and resume working as if they had never crashed.
Example Scenarios
The following is an example scenario of a permanent crash where a CP member either crashes while CP Subsystem Persistence is disabled, or it crashes while CP Subsystem Persistence is enabled but its CP data cannot be recovered:
-
If a CP member leaves the Hazelcast cluster, it is not automatically removed from the CP Subsystem because the CP Subsystem cannot determine if that member has actually crashed or just disconnected from the cluster. Therefore, absent CP members are still considered in majority calculations and cause a danger for the availability of the CP Subsystem. If you know for sure that an absent CP member is crashed, you can remove that CP member from CP Subsystem.
-
There might be a small window of unavailability after a CP member crash even if the majority of CP members are still online. For instance, if a crashed CP member is the Raft leader for some CP groups, those CP groups run a new leader election round to elect a new leader among remaining CP group members. CP Subsystem API calls that internally hit those CP groups are retried until they have new Raft leaders. If a failed CP member has the Raft follower role, it causes a very minimal disruption since Raft leaders are still able to replicate and commit operations with the majority of their CP group members.
-
If a crashed CP member is restarted after it is removed from CP Subsystem, its behavior depends on whether CP Subsystem Persistence is enabled or disabled. If enabled, a restarted CP member is not able to restore its CP data from disk because after it joins back to the cluster it notices that it is no longer a CP member. Because of that, it fails its startup process and prints an error message. The only thing to do in this case is manually delete its CP Persistence directory since its data is no longer useful. On the other hand, if CP Subsystem Persistence is disabled, a failed CP member cannot remember anything related to its previous CP identity, hence it restarts as a new AP member.
-
A CP member can encounter a network issue and disconnect from the cluster. If you remove this CP member from CP Subsystem even though it is actually alive but only disconnected, you should terminate it to prevent any accidental communication with the other CP members in CP Subsystem.
-
If a network partition occurs, behavior of CP Subsystem depends on how CP members are divided in different sides of the network partition and to which sides Hazelcast clients are connected. Each CP group remains available on the side that contains the majority of its CP members. If a Raft leader falls into the minority side, its CP group elects a new Raft leader on the other side and callers that are talking to the majority side continue to make successful API calls on CP Subsystem. However, callers that are talking to the minority side fail with operation timeouts. When the network problem is resolved, CP members reconnect to each other and CP groups continue their operation normally.
-
CP Subsystem can tolerate failure of the minority of CP members (less than
N / 2 + 1
) for availability. IfN / 2 + 1
or more CP members crash, CP Subsystem loses its availability. If CP Subsystem Persistence is enabled and the majority of CP members become online by successfully restarting some of failed CP members, CP Subsystem regains its availability back. Otherwise, it means that CP Subsystem has lost its majority irrevocably. In this case, the only solution is to wipe-out the whole CP Subsystem state by performing a force-reset.
When the CP member count is greater than the CP group size, CP groups are formed by selecting a subset of CP members. In this case, each CP group can have a different set of CP members, therefore different fault tolerance and availability conditions. In the following list, CP Subsystem’s additional fault tolerance capabilities are discussed for this configuration case.
-
When the majority of a CP group, which isn’t the
METADATA
group, permanently crash, that CP group cannot make progress anymore, even though other CP groups in CP Subsystem are running fine. Even a new CP member cannot join this CP group because membership changes also go through the Raft consensus algorithm. For this reason, the only option is to force-destroy this CP group.
CP
groups that have lost their majority must be force-destroyed immediately,
because they can block the METADATA CP group from performing membership changes on the
CP Subsystem.
|
-
If the majority of the
METADATA
CP group permanently crash, it is equivalent to the permanent crash of the majority CP members of the whole CP Subsystem, even though other CP groups are running fine. In fact, existing CP groups continue serving to incoming requests, but since theMETADATA
CP group is not available anymore, no management tasks can be performed on the CP Subsystem. For instance, a new CP group cannot be created. In this case, the only solution is to wipe-out the whole CP Subsystem state by performing a force-reset. See CP Subsystem Management.
Kubernetes
We strongly encourage using Hazelcast Platform Operator for deployments into Kubernetes. If you choose to use Helm, use the official
hazelcast/hazelcast-enterprise Helm Chart
and configure within the limitations described in this section.
|
Deployment of CP within Kubernetes is supported from Hazelcast Enterprise 5.5 and covers the
following scenarios when using Hazelcast Platform Operator or our hazelcast/hazelcast-enterprise
Helm Chart.
-
Deployment: see Deploying in Kubernetes.
-
Pause: scaling of pods to
0
-
Resume: scaling of pods back to the same number of pods defined at the point of Deployment
-
Rolling Update
-
Spurious pod restarts
We support 3, 5- and 7-CP member deployments under the constraints discussed in this section.
The method by which deployment, pause, resume and rolling update are performed will vary according to the way that CP was deployed. See Deploying in Kubernetes for more information.
|
We recommend setting data-load-timeout-seconds
to a value that spans the duration from when the first pod is running to when last pod is running and has completed its CP
intialisation procedure. This is particularly important if you intend to perform _resume scenarios. Currently the only way to determine when a CP member has completed its initialisation is to consult the logs. Therefore, we recommend the following to determine a reasonable value for data-load-timeout-seconds
:
-
Load CP with an amount of data that is representative of your production use case
-
Pause the cluster
-
Resume the cluster and determine the duration in seconds between when first pod in the
StatefulSet
running and when the last pod in theStatefulSet
is running and outputted anINFO
level log message that matches the patternCP restore completed…in
as described shortly.
If you are using a log aggregation service and want to filter key startup events within CP, you can use the INFO
level patterns emitted by CPPersistenceServiceImpl
as detailed below.
Phrase | Example Match | Description |
---|---|---|
|
|
Point at which the entire CP restoration process started. |
|
|
Point at which the entire CP restoration process completed, including notifying other CP members that the member has rejoined and the loading of its persisted data. |
|
|
Point at which a particular CP Group’s data started loading. |
|
|
Point at which a particular CP Group’s data completed loading. |