Delta WAN Synchronization

Hazelcast clusters connected over WAN can go out-of-sync because of various reasons such as member failures and concurrent updates. To overcome the out-of-sync issue, Hazelcast has the default WAN synchronization feature, through which the maps in different clusters are synced by transferring all entries from the source to the target cluster. This may be not efficient since some of the entries have remained unchanged on both clusters and do not require to be transferred. Also, for the entries to be transferred, they need to be copied to on-heap on the source cluster. This may cause spikes in the heap usage, especially if using large off-heap stores.

Besides the default WAN synchronization, Hazelcast provides Delta WAN Synchronization which uses Merkle tree for the same purpose. It is a data structure used for efficient comparison of the difference in the contents of large data structures. The precision of this comparison is defined by Merkle tree’s depth. Merkle tree hash exchanges can detect inconsistencies in the map data and synchronize only the different entries when using WAN synchronization, instead of sending all the map entries.

Currently, Delta WAN Synchronization is implemented only for Hazelcast IMap. It will also be implemented for ICache in the future releases.

Requirements

To be able to use Delta WAN synchronization, the following must be met:

Source and target cluster versions must be at least Hazelcast 3.11.
Both clusters must have the same number of partitions.
Both clusters must use the same partitioning strategy.
Both clusters must have the Merkle tree structure enabled.

Using Delta WAN Synchronization

To be able to use Delta WAN synchronization for a Hazelcast data structure:

Configure the WAN synchronization mechanism for your WAN publisher so that it uses the Merkle tree: If configuring declaratively, you can use the consistency-check-strategy sub-element of the wan-sync element. If configuring programmatically, you can use the setter of the WanSyncConfig object.
Bind that WAN synchronization configuration to the data structure (currently IMap): Simply set the WAN replication reference of your map to the name of the WAN replication configuration which uses the Merkle tree.

Following is a declarative configuration example of the above:

<hazelcast>
    ...
    <map name="myMap">
        <wan-replication-ref name="wanReplicationScheme">
        </wan-replication-ref>
    </map>
    <wan-replication name="wanReplicationScheme">
        <wan-publisher group-name="groupName">
            <class-name>...</class-name>
            <wan-sync>
                <consistency-check-strategy>MERKLE_TREES</consistency-check-strategy>
            </wan-sync>
        </wan-publisher>
    </wan-replication>
    ...
</hazelcast>

Here, the element consistency-check-strategy sets the strategy for checking the consistency of data between the source and target clusters. You must initiate the WAN synchronization (via Management Center or REST API as explained in Synchronizing WAN clusters) to let this strategy reconcile the inconsistencies. The element consistency-check-strategy has currently two values:

NONE: Means that there are no consistency checks. This is the default value.
MERKLE_TREES: Means that WAN synchronization uses Merkle tree structure.

Configuring Delta WAN Synchronization

You can configure Delta WAN Synchronization declaratively using the merkle-tree element or programmatically using the MerkleTreeConfig object.

Following is a declarative configuration example showing how to enable Delta WAN Synchronization, bind it to a Hazelcast data structure (an IMap in the below case) and specify its depth.

<hazelcast>
    ...
    <merkle-tree enabled="true">
        <mapName>someMap</mapName>
        <depth>5</depth>
    </merkle-tree>
    ...
</hazelcast>

Here are the descriptions of sub-elements and attributes:

enabled: Specifies whether the Merkle tree structure is enabled. Its default value is true.
mapName: Specifies the name of the map for which the Merkle tree structure is used.
depth: Specifies the depth of Merkle tree. Valid values are between 2 and 27 (exclusive). Its default value is 10.
- A larger depth means that a data synchronization mechanism is able to pinpoint a smaller subset of the data structure (e.g., IMap) contents in which a change has occurred. This causes the synchronization mechanism to be more efficient. However, keep in mind that a large depth means that the Merkle tree will consume more memory. As the comparison mechanism is iterative, a larger depth also prolongs the comparison duration. Therefore, it is recommended not to have large tree depths if the latency of the comparison operation is high.
- A smaller depth means that the Merkle tree is shallower and the data synchronization mechanism transfers larger chunks of the data structure (e.g., IMap) in which a possible change has happened. As you can imagine, a shallower Merkle tree will consume less memory.

Following is a declarative example including the Merkle tree configuration.

<hazelcast>
    ...
    <map name="myMap">
        <wan-replication-ref name="wanReplicationScheme">
            ...
        </wan-replication-ref>
    </map>
    <merkle-tree enabled="true">
        <mapName>myMap</mapName>
        <depth>10</depth>
    </merkle-tree>
    <wan-replication name="wanReplicationScheme">
        <wan-publisher group-name="groupName">
            <class-name>...</class-name>
            <wan-sync>
                <consistency-check-strategy>MERKLE_TREES</consistency-check-strategy>
            </wan-sync>
        </wan-publisher>
    </wan-replication>
    ...
</hazelcast>

If you do not specifically configure the merkle-tree in your Hazelcast configuration, Hazelcast uses the default Merkle tree structure values (i.e., it is enabled by default and its default depth is 10) when there is a WAN publisher using the Merkle tree (i.e., consistency-check-strategy for a WAN replication configuration is set as MERKLE_TREES and there is a data structure using that WAN replication configuration).

Merkle trees are created for each partition holding IMap data. Therefore, increasing the partition count also increases the efficiency of the Delta WAN Synchronization.

The Process

Synchronizing the maps based on Merkle trees consists of two phases:

Consistency check: Process of exchanging and comparing the hashes stored in the Merkle tree structures in the source and target clusters. The check starts with the root node and continues recursively with the children with different hash codes. Both sides send the children of the nodes that the other side sent, hence the comparison is done by depth/2 steps. After this check, the tree leaves holding different entries are identified.
Synchronization: Process of transferring the entries belong to the leaves identified by the consistency check from the source to target cluster. On the target cluster the configured merge policy is applied for each entry that is in both the source and target clusters.

If you only need the differences between the clusters, you can trigger the consistency check without performing synchronization.

Memory Consumption

Since Merkle trees are built for each partition and each map, the memory overhead of the trees with high entry count and deep trees can be significant. The trees are maintained on-heap, therefore - besides the memory consumption - garbage collection could be another concern. Make sure the configuration is tested with realistic data size before deployed in production.

The table below shows a few examples for what the memory overhead could be.

Table 1. Merkle trees memory overhead for a member
Entries Stored	Partitions Owned	Entries per Leaf	Depth	Memory Overhead
1M	271	7	10	57 MB
1M	271	1	13	97 MB
10M	271	72	10	412 MB
10M	271	9	13	453 MB
10M	5009	4	10	577 MB
10M	5009	1	12	900 MB
25M	5009	10	10	1986 MB
25M	5009	1	13	2740 MB

Defining the Depth

The efficiency of the Delta WAN Synchronization (WAN synchronization based on Merkle trees) is determined by the average number of entries per the tree leaves that is proportionate to the number of entries in the map. The bigger this average the more entries are getting synchronized for the same difference. Raising the depth decreases this average at the cost of increasing the memory overhead.

This average can be calculated for a map as avgEntriesPerLeaf = mapEntryCount / totalLeafCount, where totalLeafCount = partitionCount * 2^depth-1. The ideal value is 1, however this may come at significant memory overhead as shown in the table above.

In order to specify the tree depth, a trade-off between memory consumption and effectiveness might be needed.

Even if the map is huge and the Merkle trees are configured to be relatively shallow, the Merkle tree based synchronization may be leveraged if only a small subset of the whole map is expected to be synchronized. The table below illustrates the efficiency of the Merkle tree based synchronization compared to the default synchronization mechanism.

Table 2. Efficiency examples
Map entry count	Depth	Memory consumption	Avg entries / leaf	Difference count	Entries synced	Efficiency
10M	11	685 MB	2	5M	10M	0%
10M	12	900 MB	1	5M	5M	100%
10M	10	577 MB	4	1M	4M	150%
10M	8	497 MB	16	10K	160K	6150%
10M	12	900 MB	1	10K	10K	99900%

The Difference count column shows the number of the entries different in the source and the target clusters. This is the minimum number of the entries that need to be synchronized to make the clusters consistent. The Entries synced column shows how many entries are synchronized in the given case, calculated as Entries synced = Difference count * Avg entries / leaf.

As shown in the last two rows, the Merkle tree based synchronization transfers significantly less entries than what the default mechanism does even with 8 deep trees. The efficiency with depth 12 is even better but consumes much more memory.

The averages in the table are calculated with 5009 partitions.

The average entries per leaf number above assumes perfect distribution of the entries amongst the leaves. Since this is typically not true in real-life scenarios the efficiency can be slightly worse. The statistics section below describes how to get the actual average for the leaves involved in the synchronization.

REST API

To be able to use the REST calls related to synchronization, you need to enable the WAN REST endpoint group. See Using the REST Endpoint Groups section for details.

The two phases of the Merkle tree based synchronization can be triggered by a REST call, as it can be done with the default synchronization.

The URL for the consistency check REST call:

http://member_ip:port/hazelcast/rest/mancenter/wan/consistencyCheck/map

The URL for the synchronization REST call - the same as it is for the default synchronization:

http://member_ip:port/hazelcast/rest/mancenter/wan/sync/map

You need to add parameters to the request in both cases in the following order separated by "&";

Name of the WAN replication configuration
Target group name
Map name to be synchronized

You can also use the following URL in your REST call if you want to synchronize all the maps in source and target cluster: http://member_ip:port/hazelcast/rest/mancenter/wan/sync/allmaps

Consistency check can be triggered only for one map.

Statistics

The consistency check and the synchronization both write statistics into the diagnostics subsystem and send it to the Management Center. The following reported fields can be used to reason about the efficiency of the configuration.

Consistency check reports the number of the:

Merkle tree nodes checked
Merkle tree nodes found to be different
and entries needed to be synchronized to make the clusters consistent.

Synchronization reports the:

duration of the synchronization
number of the entries synchronized
average number of the entries per tree leaves in the synchronized leaves.