Hazelcast’s Replication Algorithm
The discussion here generally applies to any system that maintains multiple copies of
a data set. It applies to Hazelcast as well. In the context of CAP principle, Hazelcast offers
AP and CP functionality with different data structure implementations.
Data structures exposed under
HazelcastInstance API are all AP data structures.
Hazelcast also contains a CP subsystem, built on the Raft consensus algorithm and
HazelcastInstance.getCPSubsytem() which provides CP data structures and APIs.
|The replication algorithm and consistency model explained below apply to AP data structures only. For CP subsystem and CP data structures, see the CP Subsystem section.|
For AP data structures, Hazelcast employs the combination of primary-copy and configurable lazy replication techniques. As briefly described in the Data Partitioning section, each data entry is mapped to a single Hazelcast partition and put into replicas of that partition. One of the replicas is elected as the primary replica, which is responsible for performing operations on that partition. When you read or write a map entry, you transparently talk to the Hazelcast member to which primary replica of the corresponding partition is assigned. By this way, each request hits the most up-to-date version of a particular data entry in a stable cluster. Backup replicas stay in standby mode until the primary replica fails. Upon failure of the primary replica, one of the backup replicas is promoted to the primary role.
With lazy replication, when the primary replica receives an update operation for a key, it executes the update locally and propagates it to backup replicas. It marks each update with a logical timestamp so that backups apply them in the correct order and converge to the same state with the primary. Backup replicas can be used to scale reads (see the Enabling Backup Reads section) with no strong consistency but monotonic reads guarantee.
Hazelcast offers features such as SplitBrainProtection, ILock and AtomicLong. In the journey of being a highly elastic, dynamic and easy to use product, Hazelcast tries to provide best-effort consistency guarantees without being a complete CP solution. Therefore, we recommend these features to be used for efficiency purposes in general, instead of correctness. For instance, they can be used to prevent to run a resource-extensive computation multiple times, which would not create any correctness problem if runs more than once. See the Best-Effort Consistency and Network Partitioning sections for more information.
Hazelcast’s replication technique enables Hazelcast clusters to offer high throughput. However, due to temporary situations in the system, such as network interruption, backup replicas can miss some updates and diverge from the primary. Backup replicas can also hit VM or long GC pauses, and fall behind the primary, which is a situation called as replication lag. If a Hazelcast partition primary replica member crashes while there is a replication lag between itself and the backups, strong consistency of the data can be lost.
Please note that CP systems can have similar problems as well. However, in a CP system, once a replica performs an update locally (i.e., commits the update), the underlying consensus algorithm guarantees durability of the update for the rest of the execution.
On the other hand, in AP systems like Hazelcast, a replica can perform an update locally, even if the update is not to be performed on other replicas. This is a fair trade-off to reduce amount of coordination among replicas and maintain high throughput & high availability of the system. These systems employ additional measurements to maintain consistency in a best-effort manner. In this regard, Hazelcast tries to minimize the effect of such scenarios using an active anti-entropy solution as follows:
Each Hazelcast member runs a periodic task in the background.
For each primary replica it is assigned, it creates a summary information and sends it to the backups.
Then, each backup member compares the summary information with its own data to see if it is up-to-date with the primary.
If a backup member detects a missing update, it triggers the synchronization process with the primary.