Backing Up Persisted Data

You can take snapshots of your persistence store (persisted files) to be able to copy the data onto other clusters without having to shut down your cluster. This process is called a hot backup.

How Members Create Hot Backups

When a member receives a backup request, it becomes the coordinating member and sends a new backup sequence ID to all members.

If all members respond that no other backup is currently in progress and that no other backup request has already been made, then the coordinating member commands the cluster to start the backup process nearly instantaneously on all members.

During this process, each member creates a sequenced backup subdirectory in the configured backup-dir directory with the name backup-<backupSeq>.

To make the backup process more performant, the contents of files in the persistence store are not duplicated. Instead, members create a new file name for the same persisted contents on disk, using hard links. If the hard link fails for any reason, members continue by copying the data, but future backups will still try to use hard links.

Backups are transactional and cluster-wide, so either all or none of the members start the same backup sequence.
For members to use hard links, your JDK must satisfy all requirements of the Files.createLink() method.

Configuring a Backup Directory

To back up persisted data, you must first configure the backup directory in the backup-dir option:

  • XML

  • YAML

  • Java

<hazelcast>
    ...
    <persistence enabled="true">
        <backup-dir>/mnt/hot-backup</backup-dir>
	...
    </persistence>
    ...
</hazelcast>
hazelcast:
  persistence:
    enabled: true
    backup-dir: /mnt/hot-backup
PersistenceConfig PersistenceConfig = new PersistenceConfig();
PersistenceConfig.setBackupDir(new File("/mnt/hot-backup"));
...
config.setPersistenceConfig(PersistenceConfig);

Triggering a Backup Request

To trigger a new backup, you can use one of the following options:

Triggering a Backup Request in Java

  1. Put the cluster in a PASSIVE state.

    Backups may be initiated during membership changes, partition table changes, or during normal data updates. As a result, some members can have outdated versions of data before they start the backup process and copy the stale persisted data. By putting your cluster in a PASSIVE state, you can make data more consistent on all members.

  2. Trigger a backup.

    PersistenceService service = member.getCluster().getPersistenceService();
    service.backup();

    The sequence number in sequenced backup subdirectorys is generated by the hot backup process, but you can define your own sequence numbers as shown below:

    PersistenceService service = member.getCluster().getPersistenceService();
    long backupSeq = ...
    service.backup(backupSeq);
    Backups fail if any member contains a sequenced backup subdirectory with the same name.
  3. Put your cluster back in an ACTIVE state.

    Once the backup method has returned, all cluster metadata is copied and the exact partition data which needs to be copied is marked. After that, the backup process continues asynchronously and you can return the cluster to the ACTIVE state and resume operations.

Monitoring the Backup Process

Only cluster and distributed object metadata is copied synchronously during the invocation of the backup method. The rest of the persistence store is copied asynchronously after the method call has ended. You can track the progress of the backup process, using one of the following options:

An example of how to track the progress via the Java API is shown below:

PersistenceService service = member.getCluster().getPersistenceService();
BackupTaskStatus status = service.getBackupTaskStatus();
...

The returned object contains the local member’s backup status:

  • The backup state (NOT_STARTED, IN_PROGRESS, FAILURE, SUCCESS)

  • The completed count

  • The total count

The completed and total count can provide you a way to track the percentage of the copied data. Currently the count defines the number of copied and total local member persistence stores (defined by PersistenceConfig.setParallelism()) but this can change at a later point to provide greater resolution.

Besides tracking the Persistence status by API, you can view the status in the Management Center and you can inspect the on-disk files for each member. Each member creates an inprogress file which is created in each of the copied persistence stores. This means that the backup is currently in progress. When the backup task completes the backup operation, this file is removed. If an error occurs during the backup task, the inprogress file is renamed to failure which contains a stack trace of the exception.

Interrupting and Canceling a Backup

Once the backup method call has returned and asynchronous copying of the partition data has started, the backup task can be interrupted. This is helpful in situations where the backup task has started at an inconvenient time. For instance, the backup task could be automated and it could be accidentally triggered during high load on the Hazelcast instances, causing the performance of the Hazelcast instances to drop.

The backup task mainly uses disk I/O, consumes little CPU and it generally does not last for a long time (although you should test it with your environment to determine the exact impact). Nevertheless, you can abort the backup tasks on all members via a cluster-wide interrupt operation. This operation can be triggered programmatically or from the Management Center.

An example of programmatic interruption is shown below:

PersistenceService service = member.getCluster().getPersistenceService();
service.interruptBackupTask();
...

This method sends an interrupt to all members. The interrupt is ignored if the backup task is currently not in progress so you can safely call this method even though it has previously been called or when some members have already completed their local backup tasks.

You can also interrupt the local member backup task as shown below:

PersistenceService service = member.getCluster().getPersistenceService();
service.interruptLocalBackupTask();
...

The backup task stops as soon as possible and it does not remove the disk contents of the backup directory meaning that you must remove it manually.

Copying Hot Backup Data onto a Cluster

The backup process creates sequenced subdirectories named backup-<backupSeq> in the configured hot backup directory (backup-dir). To start a cluster with data from a specific backup, you need to set the base directory (base-dir) to the desired backup subdirectory.

For example, if you configure your cluster members with the following, you would copy each existing member’s backup subdirectory to the directory that’s configured in the new member’s base-dir option:

base-dir=/opt/hz/data/
backup-dir=/opt/hz/backups

So, assuming the new members also had the same configured base-dir and backup-dir, you would copy /opt/hz/backups/backup-<backupSeq>/* from the existing member to /opt/hz/data on the new member.