This is a prerelease version.

Persistence

Hazelcast Enterprise Feature

Allow individual members and whole clusters to recover faster by persisting map entries and JCache data on disk. Members can use persisted data to recover during restarts.

Persistence Overview

Persistence enables you to get your cluster up and running faster after a cluster restart or a single member failure. A restart can be caused by a planned shutdown (including rolling upgrades), a sudden cluster-wide crash, or a single member failure.

Persistence supports optional data encryption. See the Encryption at Rest section for more information.

Persistence Types

The Persistence feature is supported for the following restart types:

  • Restart after a planned shutdown:

    • The cluster is shut down completely and restarted with the exact same previous setup and data.

      You can shut down the cluster completely using the HazelcastInstance.getCluster().shutdown() method or you can manually change the cluster state to PASSIVE and then shut down each member one by one. When you send the command to shut down the cluster, the members that are not in the PASSIVE state temporarily change their states to PASSIVE. Then, each member shuts itself down by calling the HazelcastInstance.shutdown() method.

      Difference between explicitly changing state to PASSIVE before shutdown and shutting down cluster directly via HazelcastInstance.getCluster().shutdown() is, on the latter case when cluster is restarted, the cluster state will be in the latest state before shutdown. That means if cluster is ACTIVE before shutdown, cluster state automatically becomes ACTIVE after restart is completed.

    • Rolling restart: The cluster is restarted intentionally member by member. For example, this could be done to install an operating system patch or new hardware.

      To be able to shut down the cluster member by member as part of a planned restart, each member in the cluster should be in the FROZEN or PASSIVE state. After the cluster state is changed to FROZEN or PASSIVE, you can manually shut down each member by calling the method HazelcastInstance.shutdown(). When that member is restarted, it rejoins the running cluster. After all members are restarted, the cluster state can be changed back to ACTIVE.

  • Restart after a cluster crash: The cluster is restarted after all its members crashed at the same time due to a power outage, networking interruptions, etc.

  • Restart after a single member crash: A single member is restarted after a crash.

Cluster-Wide Restart Process

During the restart process, each member waits to load data until all the members in the partition table are started. During this process, no operations are allowed. Once all cluster members are started, Hazelcast changes the cluster state to PASSIVE and starts to load data. When all data is loaded, Hazelcast changes the cluster state to its previous known state before shutdown and starts to accept the operations which are allowed by the restored cluster state.

If a member fails to either start, join the cluster in time (within the timeout), or load its data, then that member is terminated immediately. After the problems causing the failure are fixed, that member can be restarted. If the cluster start cannot be completed in time, then all members fail to start. See the Configuring Persistence section for defining timeouts.

In the case of a restart after a cluster crash, the Persistence feature realizes that it was not a clean shutdown and Hazelcast tries to restart the cluster with the last saved data following the process explained above. In some cases, specifically when the cluster crashes while it has an ongoing partition migration process, currently it is not possible to restore the last saved state.

Recovering a Single Member after a Restart

When a single member restarts, its persisted data may become stale compared to the rest of the active cluster. To handle stale data, you have the following options:

  • Stop the rest of the cluster from migrating data until a certain amount of time has passed by setting the hazelcast.partition.rebalance.after.member.left.delay.seconds cluster property.

    Do not use this option if your cluster also stores in-memory data. This option stops the cluster from migrating in-memory data. As a result any data that is not persisted will be lost if the member restarts, including backups.
  • Enable the member to synchronize its persisted map data faster while also allowing the cluster to continue migrations as usual.

    Support for JCache data will be available in the full 5.0 release.

Delaying Migrations

If you have lots of persisted data and you are concerned about how long it may take for your cluster to migrate data after a member crashes, you can configure a rebalance delay to stop members from migrating persisted data too soon.

Do not use this option if your cluster also stores in-memory data. This option stops the cluster from migrating in-memory data. As a result any data that is not persisted will be lost if the member restarts, including backups.

Assume the following:

  • A cluster consisting of members A, B and C with Persistence enabled.

  • Member B is killed.

  • Member B restarts.

If member B restarts within the timeframe configured in the hazelcast.partition.rebalance.after.member.left.delay.seconds cluster property, all its persisted data will be restored from disk and no migrations will take place.

While the member is down, operations on data owned by that member will be retried until they either timeout or the member restarts and executes the requested operation. As a result, this option is best when you prefer a latency spike rather than migrating data over the network.

If a member restarts after the timeframe configured in the hazelcast.partition.rebalance.after.member.left.delay.seconds cluster property, the cluster recovers member B’s data from backups and redistributes the data among the remaining members (members A and C in this case). When member B is later restarted, it recovers its potentially stale persisted data from disk and brings it up-to-date with data from members A and C. If Merkle trees are enabled on any maps, migrations use those to request only missing persisted data. For details about how members use Merkle trees, see Synchronizing Persisted Data Faster.

Synchronizing Persisted Data Faster

If you have lots of persisted map entries as well as in-memory data that you don’t want to lose, you can configure your maps with Persistence and a Merkle tree.

The Merkle tree stores the state of persisted data in a way that other cluster members can quickly read and check what is missing. This way, after a restart, the member can send the Merkle tree to cluster and request only the missing persisted data, reducing the amount of data sent over the network.

The way in which a cluster handles a single member failure, depends on whether the cluster detects that the member crashed:

  • If the cluster detects that a member has crashed and is restarting, the cluster’s master will allow the member to send its partition table for validation. If the cluster master validates that the partition table was current at the time the member crashed, the member continues to load its persisted data from disk.

  • If the cluster cannot determine if the member crashed and automatic removal of stale data (auto-remove-stale-data) is enabled, the crashed member automatically deletes its Persistence directory inside the base directory (base-dir) and starts as a fresh, empty member. The cluster assigns some partitions to it, unrelated to the partitions it owned before going down.

  • Otherwise, the crashed member aborts the initialization and shuts down. To be able to join the cluster, the Persistence directory previously used by the crashed member must be deleted manually. You can do so, using a force start.

Force Start

A member can crash permanently and then be unable to recover from the failure. In that case, restart process cannot be completed since some of the members do not start or fail to load their own data. In that case, you can force the cluster to clean its persisted data and make a fresh start. This process is called force start.

Assume the following which is a valid scenario to use force start:

  • You have a cluster consisting of members A and B which is initially stable.

  • Cluster transitions into FROZEN or PASSIVE state.

  • Cluster gracefully shuts down.

  • Member A restarts, but member B does not.

  • Member A uses its Persistence data to initiate the Persistence procedure.

  • Since it knows the cluster originally contained member B as well, it waits for it to join.

  • This never happens.

  • Now you have the choice to Force Start the cluster without member B.

  • Cluster discards all Persistence data and starts empty.

You can trigger the force start process using the Management Center, REST API and cluster management scripts.

Please note that force start is a destructive process, which results in deletion of persisted Persistence data.

See the Persistence functionality^ of the Management Center section to learn how you can perform a force start using the Management Center.

Partial Start

When one or more members fail to start or have incorrect Persistence data (stale or corrupted data) or fail to load their Persistence data, cluster becomes incomplete and restart mechanism cannot proceed. One solution is to use Force Start and make a fresh start with existing members. Another solution is to do a partial start.

Partial start means that the cluster starts with an incomplete member set. Data belonging to those missing members is assumed lost and Hazelcast tries to recover missing data using the restored backups. For example, if you have minimum two backups configured for all maps and caches, then a partial start up to two missing members will be safe against data loss. If there are more than two missing members or there are maps/caches with less than two backups, then data loss is expected.

Partial start is controlled by cluster-data-recovery-policy configuration parameter and is not allowed by default. To enable partial start, one of the configuration values PARTIAL_RECOVERY_MOST_RECENT or PARTIAL_RECOVERY_MOST_COMPLETE should be set. See the Configuring Persistence section for details.

When partial start is enabled, Hazelcast can perform a partial start automatically or manually, in case of some members are unable to restart successfully. Partial start proceeds automatically when some members fail to start and join to the cluster in validation-timeout-seconds. After the validation-timeout-seconds duration is passed, Persistence chooses to perform partial start with the members present in the cluster. Moreover, partial start can be requested manually using the Management Center, REST API and cluster management scripts before the validation-timeout-seconds duration passes.

The other situation to decide to perform a partial start is failures during the data load phase. When Hazelcast learns data load result of all members which have passed the validation step, it automatically performs a partial start with the ones which have successfully restored their Persistence data. Please note that partial start does not expect every member to succeed in the data load step. It completes the process when it learns data load result for every member and there is at least one member which has successfully restored its Persistence data. Relatedly, if it cannot learn data load result of all members before data-load-timeout-seconds duration, it proceeds with the ones which have already completed the data load process.

Selection of members to perform partial start among live members is done according to the cluster-data-recovery-policy configuration. Set of members which are not selected by the cluster-data-recovery-policy are called Excluded members and they are instructed to perform force start. Excluded members are allowed to join cluster only when they clean their Persistence data and make a fresh-new start. This is a completely automatic process. For instance, if you start the missing members after partial start is completed, they clean their Persistence data and join to the cluster.

Please note that partial start is a destructive process. Once it is completed, it cannot be repeated with a new configuration. For this reason, one may need to perform the partial start process manually. Automatic behavior of partial start relies on validation-timeout-seconds and data-load-timeout-seconds configuration values. If you need to control the process manually, validation-timeout-seconds and data-load-timeout-seconds properties can be set to very big values so that Hazelcast cannot make progress on timeouts automatically. Then, the overall process can be managed manually via aforementioned methods, i.e., Management Center, REST API and cluster management scripts.

Configuring Persistence

You can configure Persistence feature programmatically or declaratively. There are two steps of configuration:

  1. Enabling and configuring the Persistence feature globally in your Hazelcast configuration: This is done using the configuration element persistence. See the Global Persistence Configuration section below.

  2. Enabling and configuring the Hazelcast data structures to use the Persistence feature: This is done using the configuration element persistence. See the Per Data Structure Persistence Configuration section below.

Global Persistence Configuration

This is where you configure the Persistence feature itself using the persistence element. The following are the descriptions of its attribute and sub-elements:

  • enabled: Attribute of the persistence element which specifies whether the feature is globally enabled in your Hazelcast configuration. Set this attribute to true if you want any of your data structures to use the Persistence feature.

  • base-dir: Specifies the parent directory where the Persistence data is stored. The default value for base-dir is persistence. You can use the default value, or you can specify the value of another folder containing the Persistence configuration, but it is mandatory that base-dir element has a value. This directory is created automatically if it does not exist.

    base-dir is used as the parent directory, and a unique Persistence directory is created inside base-dir for each Hazelcast member which uses the same base-dir. That means, base-dir can be shared among multiple Hazelcast members safely. This is especially useful for cloud environments where the members generally use a shared filesystem.

    When a Hazelcast member starts, it tries to acquire the ownership of the first available Persistence directory inside the base-dir. If base-dir is empty or if the starting member fails to acquire the ownership of any directory (happens when all the directories are already acquired by other Hazelcast members), then it creates its own fresh directory.

    Previously, base-dir was being used only by a single Hazelcast member. If such an existing base-dir is configured for a Hazelcast member, Persistence starts in legacy mode and base-dir is used only by a single member, without creating a unique sub-directory. Other members trying to use that base-dir fails during the startup.
  • backup-dir: Specifies the directory under which Persistence snapshots (Hot Backups) are stored. See the Hot Backup section for more information.

  • parallelism: Level of parallelism in Persistence. There are this many I/O threads, each writing in parallel to its own files. During the Persistence procedure, this many I/O threads are reading the files and this many rebuilder threads are rebuilding the Persistence metadata. The default value for this property is 1. This is a good default in most but not all cases. You should measure the raw I/O throughput of your infrastructure and test with different values of parallelism. In some cases such as dedicated hardware higher parallelism can yield more throughput of Persistence. In other cases such as running on EC2, it can yield diminishing returns - more thread scheduling, more contention on I/O and less efficient garbage collection.

  • validation-timeout-seconds: Validation timeout for the Persistence process when validating the cluster members expected to join and the partition table on the whole cluster.

  • data-load-timeout-seconds: Data load timeout for the Persistence process. All members in the cluster should finish restoring their local data before this timeout.

  • cluster-data-recovery-policy: Specifies the data recovery policy that is respected during the Persistence cluster start. Valid values are;

    • FULL_RECOVERY_ONLY: Starts the cluster only when all expected members are present and correct. Otherwise, it fails. This is the default value.

    • PARTIAL_RECOVERY_MOST_RECENT: Starts the cluster with the members which have most up-to-date partition table and successfully restored their data. All other members leave the cluster and force start themselves. If no member restores its data successfully, cluster start fails.

    • PARTIAL_RECOVERY_MOST_COMPLETE: Starts the cluster with the largest group of members which have the same partition table version and successfully restored their data. All other members leave the cluster and force start themselves. If no member restores its data successfully, cluster start fails.

  • auto-remove-stale-data: Enables automatic removal of the stale Persistence data. When a member terminates or crashes when the cluster state is ACTIVE, the remaining members redistribute the data among themselves and the data persisted on terminated member’s storage becomes stale. That terminated member cannot rejoin the cluster without removing Persistence data. When auto-removal of stale Persistence data is enabled, while restarting that member, Persistence data is automatically removed and it joins the cluster as a completely new member. Otherwise, Persistence data should be removed manually.

  • encryption-at-rest: Configures encryption on the Persistence data level. See the Encryption at Rest section for more information.

Per Data Structure Persistence Configuration

This is where you configure the data structures of your choice, so that they can have the Persistence feature. This is done using the persistence configuration element. As it is explained in the introduction paragraph, Persistence feature is currently supported by Hazelcast map data structure and JCache implementation (map and cache), each of which has the persistence configuration element. The following are the descriptions of this element’s attribute and sub-element:

  • enabled: Attribute of the persistence element which specifies whether the Persistence feature is enabled for the related data structure. Its default value is false.

  • fsync: Turning on fsync guarantees that data is persisted to the disk device when a write operation returns successful response to the caller. By default, fsync is turned off (false). That means data is persisted to the disk device eventually, instead of on every disk write. This generally provides a better performance.

As well as these options, maps also have the following elements:

  • merkle-tree

    • depth: The depth of the Merkle tree. The deeper the tree, the more accurate the difference detection and the more space is needed to store the Merkle tree in memory.

Persistence Configuration Examples

The following are example configurations for a Hazelcast map and JCache implementation.

Declarative Configuration:

An example configuration is shown below.

  • XML

  • YAML

<hazelcast>
    ...
    <persistence enabled="true">
        <base-dir>/mnt/persistence</base-dir>
        <backup-dir>/mnt/hot-backup</backup-dir>
        <validation-timeout-seconds>120</validation-timeout-seconds>
        <data-load-timeout-seconds>900</data-load-timeout-seconds>
        <cluster-data-recovery-policy>FULL_RECOVERY_ONLY</cluster-data-recovery-policy>
    </persistence>
    ...
    <map name="test-map">
        <merkle-tree enabled="true" >
            <depth>12</depth>
        </merkle-tree>
        <persistence enabled="true">
            <fsync>false</fsync>
        </persistence>
    </map>
    ...
    <cache name="test-cache">
        <persistence enabled="true">
            <fsync>false</fsync>
        </persistence>
    </cache>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    base-dir: /mnt/persistence
    backup-dir: /mnt/hot-backup
    validation-timeout-seconds: 120
    data-load-timeout-seconds: 900
    cluster-data-recovery-policy: FULL_RECOVERY_ONLY
  map:
    test-map:
      merkle-tree:
        enabled: true
        depth: 12
      persistence:
        enabled: true
        fsync: false
  cache:
    test-cache:
      persistence:
        enabled: true
        fsync: false

Programmatic Configuration:

The programmatic equivalent of the above declarative configuration is shown below.

        Config config = new Config();
        PersistenceConfig PersistenceConfig = new PersistenceConfig()
        .setEnabled(true)
        .setBaseDir(new File("/mnt/persistence"))
        .setParallelism(1)
        .setValidationTimeoutSeconds(120)
        .setDataLoadTimeoutSeconds(900)
        .setClusterDataRecoveryPolicy(PersistenceClusterDataRecoveryPolicy.FULL_RECOVERY_ONLY)
        .setAutoRemoveStaleData(true);
        config.setPersistenceConfig(PersistenceConfig);

        MapConfig mapConfig = config.getMapConfig("test-map");
        mapConfig.getDataPersistenceConfig().setEnabled(true);

        CacheSimpleConfig cacheConfig = config.getCacheConfig("test-cache");
        cacheConfig.getDataPersistenceConfig().setEnabled(true);

Configuring Persistence Store on Intel® Optane™ DC Persistent Memory

Hazelcast can be configured to use Intel® Optane™ DC Persistent Memory as the Persistence directory. For this, you need to perform the following steps:

  1. Configure the Persistent Memory as a File System

  2. Configure the Persistence Store to Use Persistent Memory

Using Persistent Memory, Persistence times can be drastically improved. Let’s describe the steps in detail in the following sections.

Configuring the Persistent Memory as a File System

If the persistent memory DIMMs (dual in-line memory modules) are already configured and mounted as a file system, you can skip the instructions given in this section and directly go to the next section.

The persistent memory DIMMs can operate in two modes: MemoryMode or AppDirect. See here for their descriptions. To be able to use it with Persistence, DIMMs should be configured with AppDirect mode so you can mount DIMMs as a file system.

The following configuration tools must be installed on your system:

The following are the steps:

  1. First, check the current setup of the system:

    [root@localhost builder]# ipmctl show -socket
    
     SocketID | MappedMemoryLimit | TotalMappedMemory
    ==================================================
     0x0000   | 4096.0 GiB        | 95.0 GiB
     0x0001   | 4096.0 GiB        | 852.0 GiB

    The output shown above provides the CPU sockets of the system. You can print the DIMMs of each socket by using its ID, as shown below.

    [root@localhost builder]# ipmctl show -dimm -socket 0x0000
    
     DimmID | Capacity  | HealthState | ActionRequired | LockState | FWVersion
    ==============================================================================
     0x0011 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877
     0x0021 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877
     0x0001 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877
     0x0111 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877
     0x0121 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877
     0x0101 | 126.4 GiB | Healthy     | 0              | Disabled  | 01.00.00.4877

    You can also see the current configuration of the system, as shown below:

    [root@localhost builder]# ipmctl show -region
    
     SocketID | ISetID             | PersistentMemoryType | Capacity  | FreeCapacity | HealthState
    ===============================================================================================
     0x0001   | 0xb5b67f48a7c32ccc | AppDirect            | 756.0 GiB | 0.0 GiB      | Healthy

    The above example output shows that the DIMMs of the socket with the SocketID 0x0000 is not in use. So, let’s configure 0x0000 for Persistence following the steps below.

  2. Use the following command for the socket 0x0000:

    [root@localhost builder]# ipmctl create -goal -socket 0x0000 PersistentMemoryType=AppDirect
    
    The following configuration will be applied:
     SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
    ==================================================================
     0x0000   | 0x0011 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0021 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0001 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0111 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0121 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0101 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
    Do you want to continue? [y/n] y
    
    Created following region configuration goal
     SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
    ==================================================================
     0x0000   | 0x0011 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0021 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0001 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0111 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0121 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
     0x0000   | 0x0101 | 0.0 GiB    | 126.0 GiB      | 0.0 GiB
    A reboot is required to process new memory allocation goals.
  3. Reboot your system. After the reboot, check the regions and namespaces in the system as shown below:

    [root@localhost builder]# ndctl list --regions --human -N
    [
      {
        "dev":"region1",
        "size":"756.00 GiB (811.75 GB)",
        "available_size":0,
        "max_available_extent":0,
        "type":"pmem",
        "iset_id":"0xb5b67f48a7c32ccc",
        "persistence_domain":"memory_controller",
        "namespaces":[
          {
            "dev":"namespace1.0",
            "mode":"fsdax",
            "map":"dev",
            "size":"744.19 GiB (799.06 GB)",
            "uuid":"65121d0e-a8a0-40f1-aed5-8a8ada13b6c7",
            "blockdev":"pmem1"
          }
        ]
      },
      {
        "dev":"region0",
        "size":"756.00 GiB (811.75 GB)",
        "available_size":"756.00 GiB (811.75 GB)",
        "max_available_extent":"756.00 GiB (811.75 GB)",
        "type":"pmem",
        "iset_id":"0x63f47f485dd02ccc",
        "persistence_domain":"memory_controller"
      }
    ]

    You can see “region0” has been created with the DIMMs of the socket (ID = 0x0000) in the above output.

  4. Now, create a namespace for “region0” as shown below:

    [root@localhost builder]# ndctl create-namespace --mode fsdax --region region0
    {
      "dev":"namespace0.0",
      "mode":"fsdax",
      "map":"dev",
      "size":"744.19 GiB (799.06 GB)",
      "uuid":"87449768-1cc7-4c1b-b138-ea79bc4ee68e",
      "raw_uuid":"6756ef99-744f-4467-90f7-591c0ae162ec",
      "sector_size":512,
      "blockdev":"pmem0",
      "numa_node":0
    }
  5. You should be able see the device as shown below:

    [root@localhost builder]# ll /dev/pmem0
    brw-rw----. 1 root disk 259, 0 Mar 4 02:35 /dev/pmem0
  6. Format the partition with ext4 file system using the following command:

    [root@localhost builder]# mkfs.ext4 /dev/pmem0
  7. Create a mount point and mount the new filesystem to that mount point using the following commands:

    [root@localhost builder]# mkdir /mnt/pmem0
    [root@localhost builder]# mount -o dax /dev/pmem0 /mnt/pmem0
Configuring the Persistence Store to Use Persistent Memory

After you completed the steps explained in the previous section, you can now create a directory under /mnt/pmem0 and configure Hazelcast to use that as the Persistence directory. See the Configuring Persistence section to see how to do it.

As an example, let’s create a directory named persistence under /mnt/pmem0:

[root@localhost builder]# mkdir /mnt/pmem0/persistence

To use this as the Persistence directory, the configuration should look as follows:

<persistence enabled="true">
    <base-dir>/mnt/pmem0/persistence</base-dir>
    <parallelism>12</parallelism>
</persistence>

You can set parallelism to 8 or 12 for the best performance.

Moving/Copying Persistence Data

After Hazelcast member owning the Persistence data is shutdown, Persistence base-dir can be copied/moved to a different server (which may have different IP address and/or different number of CPU cores) and Hazelcast member can be restarted using the existing Persistence data on that new server. Having a new IP address does not affect Persistence, since it does not rely on the IP address of the server but instead uses Member UUID as a unique identifier.

This flexibility provides the following abilities:

  • Replacing one or more faulty servers with the new ones easily without touching remaining cluster.

  • Using Persistence on the cloud environments easily. Sometimes cloud providers do not preserve the IP addresses on restart or after shutdown. Also it is possible to startup the whole cluster on a different set of machines.

  • Copying production data to the test environment, so that a more functional test cluster can bet setup.

Unfortunately having different number of CPU cores is not that straightforward. Hazelcast partition threads, by default, uses a heuristic from the number of cores, e.g., # of partition threads = # of CPU cores. When a Hazelcast member is started on a server with a different CPU core count, number of Hazelcast partition threads changes and that makes Persistence fail during the startup. Solution is to explicitly set number of Hazelcast partition threads (hazelcast.operation.thread.count system property) and Persistence parallelism configuration and use the same parameters on the new server. For setting system properties see the System Properties appendix.

Persistence Design Details

Hazelcast’s Persistence uses the log-structured storage approach. The following is a top-level design description:

  • The only kind of update operation on persistent data is appending.

  • What is appended are facts about events that happened to the data model represented by the store; either a new value was assigned to a key or a key was removed.

  • Each record associated with a key makes stale the previous record that was associated with that key.

  • Stale records contribute to the amount of garbage present in the persistent storage.

  • Measures are taken to remove garbage from the storage.

This kind of design focuses almost all the system’s complexity into the garbage collection (GC) process, stripping down the client’s operation to the bare necessity of guaranteeing persistent behavior: a simple file append operation. Consequently, the latency of operations is close to the theoretical minimum in almost all cases. Complications arise only during prolonged periods of maximum load; this is where the details of the GC process begin to matter.

Concurrent, Incremental, Generational GC

In order to maintain the lowest possible footprint in the update operation latency, the following properties are built into the garbage collection process:

  • A dedicated thread performs the GC. In Hazelcast terms, this thread is called the Collector and the application thread is called the Mutator.

  • On each update there is metadata to be maintained; this is done asynchronously by the Collector thread. The Mutator enqueues update events to the Collector’s work queue.

  • The Collector keeps draining its work queue at all times, including the time it goes through the GC cycle. Updates are taken into account at each stage in the GC cycle, preventing the copying of already dead records into compacted files.

  • All GC-induced I/O competes for the same resources as the Mutator’s update operations. Therefore, measures are taken to minimize the impact of I/O done during GC:

    • data is never read from files, but from RAM

    • a heuristic scheme is employed which minimizes the number of bytes written to the disk for each kilobyte of the reclaimed garbage

    • measures are also taken to achieve a good interleaving of Collector and Mutator operations, minimizing latency outliers perceived by the Mutator

I/O Minimization Scheme

The success of this scheme is subject to a bet on the Weak Generational Garbage Hypothesis, which states that a new record entering the system is likely to become garbage soon. In other words, a key updated now is more likely than average to be updated again soon.

The scheme was taken from the seminal Sprite LFS paper, Rosenblum, Ousterhout, The Design and Implementation of a Log-Structured File System. The following is an outline of the paper:

  • Data is not written to one huge file, but to many files of moderate size (8 MB) called "chunks".

  • Garbage is collected incrementally, i.e. by choosing several chunks, then copying all their live data to new chunks, then deleting the old ones.

  • I/O is minimized using a collection technique which results in a bimodal distribution of chunks with respect to their garbage content: most files are either almost all live data or they are all garbage.

  • The technique consists of two main principles:

    • Chunks are selected based on their Cost-Benefit factor (see below).

    • Records are sorted by age before copying to new chunks.

Cost-Benefit Factor

The Cost-Benefit factor of a chunk consists of two components multiplied together:

  1. The ratio of benefit (amount of garbage that can be collected) to I/O cost (amount of live data to be written).

  2. The age of the data in the chunk, measured as the age of the youngest record it contains.

The essence is in the second component: given equal amount of garbage in all chunks, it makes the young ones less attractive to the Collector. Assuming the generational garbage hypothesis, this allows the young chunks to quickly accumulate more garbage. On the flip side, it also ensures that even files with little garbage are eventually garbage collected. This removes garbage which would otherwise linger on, thinly spread across many chunk files.

Sorting records by age groups the young records together in a single chunk and does the same for older records. Therefore the chunks are either tend to keep their data live for a longer time, or quickly become full of garbage.

Persistence Performance Considerations

In this section you can find performance test summaries which are results of benchmark tests performed with a single Hazelcast member running on a physical server and on AWS R3.

Performance on a Physical Server

We have tested a member which has an IMap with High-Density Data Store. Its data size is changed for each test, started from 10 GB to 500 GB (each map entry has a value of 1 KB).

The tests investigate the write and read performance of Persistence and are performed on HP ProLiant servers with RHEL 7 operating system using Hazelcast Simulator.

The following are the specifications of the server hardware used for the test:

  • CPU: 2x Intel® Xeon® CPU E5-2687W v3 @ 3.10GHz – with 10 cores per processor. Total 20 cores, 40 with hyper threading enabled.

  • Memory: 768GB 2133 MHz memory 24x HP 32GB 4Rx4 PC4-2133P-L Kit

The following are the storage media used for the test:

  • A hot-pluggable 2.5 inch HDD with 1 TB capacity and 10K RPM.

  • An SSD, Light Endurance PCle Workload Accelerator.

The below table shows the test results.

Persistence Perf

Performance on AWS R3

We have tested a member which has an IMap with High-Density Data Store:

  • This map has 40 million distinct keys, each map entry is 1 KB.

  • High-Density Memory Store is 59 GiB whose 19% is metadata.

  • Persistence is configured with fsync turned off.

  • Data size reloaded on restart is 38 GB.

The tests investigate the write and read performance of Persistence and are performed on R3.2xlarge and R3.4xlarge EC2 instances using Hazelcast Simulator.

The following are the AWS storage types used for the test:

  • Elastic Block Storage (EBS) General Purpose SSD (GP2)

  • Elastic Block Storage with Provisioned IOPS (IO1) (Provisioned 10,000 IOPS on a 340 GiB volume, enabled EBS-optimized on instance)

  • SSD-backed instance store

The below table shows the test results.

Persistence Perf2

Hot Backup

During Persistence operations you can take a snapshot of the Persistence Store at a certain point in time. This is useful when you wish to bring up a new cluster with the same data or parts of the data. The new cluster can then be used to share load with the original cluster, to perform testing, QA or reproduce an issue on production data.

Simple file copying of a currently running cluster does not suffice and can produce inconsistent snapshots with problems such as resurrection of deleted values or missing values.

Configuring Hot Backup

To create snapshots you must first configure the Persistence backup directory. You can configure the directory programmatically or declaratively using the following configuration element:

  • backup-dir: This element is included in the persistence and denotes the destination under which backups are stored. If this element is not defined, hot backup is disabled. If a directory is defined which does not exist, it is created on the first backup. To avoid clashing data on multiple backups, each backup has a unique sequence ID which determines the name of the directory which contains all Persistence data. This unique directory is created as a subdirectory of the configured backup-dir.

The following are the example configurations for Hot backup.

Declarative Configuration:

An example configuration is shown below.

  • XML

  • YAML

<hazelcast>
    ...
    <persistence enabled="true">
        <backup-dir>/mnt/hot-backup</backup-dir>
	...
    </persistence>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    backup-dir: /mnt/hot-backup

Programmatic Configuration:

The programmatic equivalent of the above declarative configuration is shown below.

PersistenceConfig PersistenceConfig = new PersistenceConfig();
PersistenceConfig.setBackupDir(new File("/mnt/hot-backup"));
...
config.setPersistenceConfig(PersistenceConfig);

Using Hot Backup

Once configured, you can initiate a new backup via API or from the Management Center. The backup is started transactionally and cluster-wide. This means that either all or none of the members start the same backup. The member which receives the backup request determines a new backup sequence ID and sends that information to all members. If all members respond that no other backup is currently in progress and that no other backup request has already been made, then the coordinating member commands the other members to start the actual backup process. This creates a directory under the configured backup-dir with the name backup-<backupSeq> and start copying the data from the original store.

The backup process is initiated nearly instantaneously on all members. Note that since there is no limitation as to when the backup process is initiated, it may be initiated during membership changes, partition table changes or during normal data update. Some of these operations may not be completed fully yet, which means that some members will backup some data while some members will backup a previous version of the same data. This is usually solved by the anti-entropy mechanism on the new cluster which reconciles different versions of the same data. Please check the Achieving High Consistency of Backup Data section for more information.

The duration of the backup process and the disk data usage drastically depends on what is supported by the system and the configuration. Please check the Achieving high performance of backup process section for more information on achieving better resource usage of the backup process.

Following is an example of how to trigger the Hot Backup via API:

HotRestartService service = instance.getCluster().getHotRestartService();
service.backup();

The backupSeq is generated by the hot backup process, but you can define your own backup sequences as shown below:

HotRestartService service = instance.getCluster().getHotRestartService();
long backupSeq = ...
service.backup(backupSeq);

Keep in mind that the backup fails if any member contains a backup directory with the name backup-<backupSeq>, where backupSeq is the given sequence.

Starting the Cluster From a Hot Backup

As mentioned in the previous section, hot backup process creates subdirectories named backup-<backupSeq> under the configured hot backup directory (backup-dir). When starting your cluster with data from a hot backup, you need to set the base directory (base-dir) to the desired backup subdirectory.

For example:

  • XML

  • YAML

<hazelcast>
    ...
    <persistence enabled="true">
        <backup-dir>/mnt/hot-backup</backup-dir>
	...
    </persistence>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    backup-dir: /mnt/hot-backup

To start a cluster with data from a backup in a subdirectory named /mnt/hot-backup/backup-2018Oct24, here is the configuration you should have for the base-dir:

  • XML

  • YAML

<hazelcast>
    ...
    <persistence enabled="true">
        <base-dir>backup-2018Oct24</base-dir>
        <parallelism>1</parallelism>
    </persistence>
    ...
    <map name="test-map">
        <persistence enabled="true">
            <fsync>false</fsync>
        </persistence>
    </map>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    base-dir: backup-2018Oct24
    parallelism: 1
  map:
    test-map:
      persistence:
        enabled: true
        fsync: false

Achieving High Consistency of Backup Data

The backup is initiated nearly simultaneously on all members but you can encounter some inconsistencies in the data. This is because some members might have and some might not have received updated values yet from executed operations, because the system could be undergoing partition and membership changes or because there are some transactions which have not yet been committed.

To achieve a high consistency of data on all members, the cluster should be put to PASSIVE state for the duration of the call to the backup method. See the Cluster Member States section on information on how to do this. The cluster does not need to be in PASSIVE state for the entire duration of the backup process, though. Because of the design, only partition metadata is copied synchronously during the invocation of the backup method. Once the backup method has returned, all cluster metadata is copied and the exact partition data which needs to be copied is marked. After that, the backup process continues asynchronously and you can return the cluster to the ACTIVE state and resume operations.

Achieving High Performance of Backup Process

Because of the design of Persistence Store, we can use hard links to achieve backups/snapshots of the store. The hot backup process uses hard links whenever possible because they provide big performance benefits and because the backups share disk usage.

The performance benefit comes from the fact that Persistence file contents are not being duplicated (thus using disk and I/O resources) but rather a new file name is created for the same contents on disk (another pointer to the same inode). Since all backups and stores share the same inode, disk usage drops.

The bigger the percentage of stable data in the Persistence Store (data not undergoing changes), the more files each backup shares with the operational Persistence Store and the less disk space it uses. For the hot backup to use hard links, you must be running Hazelcast members on JDK 7 or higher and must satisfy all requirements for the Files.createLink() method to be supported.

The backup process initially attempts to create a new hard link and if that fails for any reason it continues by copying the data. Subsequent backups also attempt to use hard links.

Backup Process Progress and Completion

Only cluster and distributed object metadata is copied synchronously during the invocation of the backup method. The rest of the Persistence Store containing partition data is copied asynchronously after the method call has ended. You can track the progress by API or view it from the Management Center.

An example of how to track the progress via API is shown below:

HotRestartService service = instance.getCluster().getHotRestartService();
BackupTaskStatus status = service.getBackupTaskStatus();
...

The returned object contains the local member’s backup status:

  • the backup state (NOT_STARTED, IN_PROGRESS, FAILURE, SUCCESS)

  • the completed count

  • the total count

The completed and total count can provide you a way to track the percentage of the copied data. Currently the count defines the number of copied and total local member Persistence Stores (defined by PersistenceConfig.setParallelism()) but this can change at a later point to provide greater resolution.

Besides tracking the Persistence status by API, you can view the status in the Management Center and you can inspect the on-disk files for each member. Each member creates an inprogress file which is created in each of the copied Persistence Stores. This means that the backup is currently in progress. When the backup task completes the backup operation, this file is removed. If an error occurs during the backup task, the inprogress file is renamed to failure which contains a stack trace of the exception.

Backup Task Interruption and Cancellation

Once the backup method call has returned and asynchronous copying of the partition data has started, the backup task can be interrupted. This is helpful in situations where the backup task has started at an inconvenient time. For instance, the backup task could be automatized and it could be accidentally triggered during high load on the Hazelcast instances, causing the performance of the Hazelcast instances to drop.

The backup task mainly uses disk IO, consumes little CPU and it generally does not last for a long time (although you should test it with your environment to determine the exact impact). Nevertheless, you can abort the backup tasks on all members via a cluster-wide interrupt operation. This operation can be triggered programmatically or from the Management Center.

An example of programmatic interruption is shown below:

HotRestartService service = instance.getCluster().getHotRestartService();
service.interruptBackupTask();
...

This method sends an interrupt to all members. The interrupt is ignored if the backup task is currently not in progress so you can safely call this method even though it has previously been called or when some members have already completed their local backup tasks.

You can also interrupt the local member backup task as shown below:

HotRestartService service = instance.getCluster().getHotRestartService();
service.interruptLocalBackupTask();
...

The backup task stops as soon as possible and it does not remove the disk contents of the backup directory meaning that you must remove it manually.

Encryption at Rest

Records stored in the Persistence Store may contain sensitive information. This sensitive information may be present in the keys, in the values, or in both. In Persistence terms, Encryption at Rest concerns with encryption on the chunk file level. Since complete chunk files are encrypted, all data stored in the Persistence Store is protected when Encryption at Rest is enabled.

Data persisted in the Persistence Store is encrypted using symmetric encryption. The implementation is based on Java Cryptography Architecture (JCA). The encryption scheme uses two levels of encryption keys: auto-generated Persistence Store-level encryption keys (one per configured parallelism) that are used to encrypt the chunk files and a master encryption key that is used to encrypt the store-specific encryption keys. The master encryption key is sourced from an external system called Secure Store and, in contrast to the Persistence Store-level encryption keys, it is not persisted anywhere within the Persistence Store.

When Persistence with Encryption at Rest is first enabled on a member, the member contacts the Secure Store during the startup and retrieves the master encryption key. Then it generates the Persistence Store-level encryption keys for the parallel Stores and stores them (encrypted using the master key) under the Persistence Store’s directory. The subsequent writes to Hot Restart chunk files will be encrypted using the Store-level encryption key. During Persistence, the member retrieves the master encryption key from the Secure Store, decrypts the Store-level encryption keys and uses those to decrypt the chunk files.

Master key rotation is supported. If the master encryption key changes in the Secure Store, the Persistence subsystem will detect it and retrieve the new master encryption key. During this process, it will also re-encrypt the Persistence Store-level encryption keys using the new master encryption key.

The Configuring a Secure Store section provides information about the supported Secure Store types.

Configuring Encryption at Rest

Encryption at Rest can be enabled and configured programmatically or declaratively using the encryption-at-rest sub-element of persistence. The encryption-at-rest element has the following attributes and sub-elements:

  • enabled: Attribute that specifies whether Encryption at Rest is enabled; false by default.

  • algorithm: Specifies the symmetric cipher to use (such as AES/CBC/PKCS5Padding).

  • salt: The encryption salt.

  • key-size: The size of the auto-generated Persistence Store-level encryption key.

  • secure-store: Specifies the Secure Store to use for the retrieval of master encryption keys. See the Configuring a Secure Store section for more details.

The following are the example configurations for Encryption at Rest.

Declarative Configuration:

An example configuration is shown below.

  • XML

  • YAML

<hazelcast>
    ...
    <persistence enabled="true">
        ...
        <encryption-at-rest enabled="true">
            <algorithm>AEC/CBC/PKCS5Padding</algorithm>
            <salt>thesalt</salt>
            <key-size>128</key-size>
            <secure-store>...</secure-store>
        </encryption-at-rest>
        ...
    </persistence>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    encryption-at-rest:
      enabled: true
      algorithm: AES/CBC/PKCS5Padding
      salt: thesalt
      key-size: 128
      secure-store:
         ...

Programmatic Configuration:

The programmatic equivalent of the above declarative configuration is shown below.

        PersistenceConfig PersistenceConfig = new PersistenceConfig();
        EncryptionAtRestConfig encryptionAtRestConfig =
                PersistenceConfig.getEncryptionAtRestConfig();
        encryptionAtRestConfig.setEnabled(true)
                .setAlgorithm("AES/CBC/PKCS5Padding")
                .setSalt("thesalt")
                .setKeySize(128)
                .setSecureStoreConfig(secureStore());

Configuring a Secure Store

A Secure Store represents a (secure) source of master encryption keys and is required for using Encryption at Rest.

Hazelcast Enterprise provides Secure Store implementations for the Java KeyStore and for HashiCorp Vault.

Java KeyStore Secure Store

The Java KeyStore Secure Store provides integration with the Java KeyStore. It can be configured programmatically or declaratively using the keystore sub-element of secure-store. The keystore element has the following sub-elements:

  • path: The path to the KeyStore file.

  • type: The type of the KeyStore (PKCS12, JCEKS, etc.).

  • password: The KeyStore password.

  • current-key-alias: The alias for the current encryption key entry (optional).

  • polling-interval: The polling interval (in seconds) for checking for changes in the KeyStore. Disabled by default.

Sensitive configuration properties such as password should be protected using encryption replacers.

The Java KeyStore Secure treats all KeyStore.SecretKeyEntry entries stored in the KeyStore as encryption keys. It expects that these entries use the same protection password as the KeyStore itself. Entries of other types (private key entries, certificate entries) are ignored. If current-key-alias is set, the corresponding entry will be treated as the current encryption key; otherwise, the highest entry in the alphabetical order will be used. The remaining entries will represent historical versions of the encryption key.

An example declarative configuration is shown below:

  • XML

  • YAML

<secure-store>
    <keystore>
        <path>/path/to/keystore.file</path>
        <type>PKCS12</type>
        <password>password</password>
        <current-key-alias>current</current-key-alias>
        <polling-interval>60</polling-interval>
    </keystore>
</secure-store>
      secure-store:
        keystore:
          path: /path/to/keystore.file
          type: PKCS12
          password: password
          current-key-alias: current
          polling-interval: 60

The following is an equivalent programmatic configuration:

        JavaKeyStoreSecureStoreConfig keyStoreConfig =
                new JavaKeyStoreSecureStoreConfig(new File("/path/to/keystore.file"))
                        .setType("PKCS12")
                        .setPassword("password")
                        .setCurrentKeyAlias("current")
                        .setPollingInterval(60);

HashiCorp Vault Secure Store

The HashiCorp Vault Secure Store provides integration with HashiCorp Vault. It can be configured programmatically or declaratively using the vault sub-element of secure-store. The vault element has the following sub-elements:

  • address: The address of the Vault server.

  • secret-path: The secret path under which the encryption keys are stored.

  • token: The Vault authentication token.

  • polling-interval: The polling interval (in seconds) for checking for changes in Vault. Disabled by default.

  • ssl: The TLS/SSL configuration for HTTPS support. See the TLS/SSL section for more information about how to use the ssl element.

Sensitive configuration properties such as token should be protected using encryption replacers.

The HashiCorp Vault Secure Store implementation uses the official REST API to integrate with HashiCorp Vault. Only for the KV secrets engine, both KV V1 and KV V2 can be used, but since only V2 provides secrets versioning, this is the recommended option. With KV V1 (no versioning support), only one version of the encryption key can be kept, whereas with KV V2, the HashiCorp Vault Secure Store is able to retrieve also the historical encryption keys. (Note that the size of the version history is configurable on the Vault side.) Having access to the previous encryption keys may be critical to avoid scenarios where the Persistence data becomes undecryptable because the master encryption key is no longer usable (for instance, when the original master encryption key got rotated out in the Secure Store while the cluster was down).

The encryption key is expected to be stored at the specified secret path and represented as a single key/value pair in the following format:

name=Base64-encoded-data

where name can be an arbitrary string. Multiple key/value pairs under the same secret path are not supported. Here is an example of how such a key/value pair can be stored using the HashiCorp Vault command-line client (under the secret path hz/cluster):

vault kv put hz/cluster value=HEzO124Vz...

With KV V2, a second put to the same secret path creates a new version of the encryption key. With KV V1, it simply overwrites the current encryption key, discarding the old value.

An example declarative configuration is shown below:

  • XML

  • YAML

<secure-store>
    <vault>
        <address>http://localhost:1234</address>
        <secret-path>secret/path</secret-path>
        <token>token</token>
        <polling-interval>60</polling-interval>
        <ssl>...</ssl>
    </vault>
</secure-store>
      secure-store:
        vault:
          address: http://localhost:1234
          secret-path: secret/path
          token: token
          polling-interval: 60
          ssl:
            ...

The following is an equivalent programmatic configuration:

        VaultSecureStoreConfig vaultConfig =
                new VaultSecureStoreConfig("http://localhost:1234", "secret/path", "token")
                        .setPollingInterval(60);
        configureSSL(vaultConfig.getSSLConfig());