A newer version of Hazelcast Platform is available.

View latest

Backing Up the Persistence Store

You can trigger backups of your persistence store, using the Java API, REST API, or Management Center. Backing up the persistence store is useful if you want to copy the data onto other clusters without having to shut down your cluster.

Before you Begin

To back up data in the persistence store, the cluster must be configured with a directory in the persistence.backup-dir configuration. See Configuring Persistence.

How Members Create Backups

When a member receives a backup request, it becomes the coordinating member and sends a new backup sequence ID to all members.

If all members respond that no other backup is currently in progress and that no other backup request has already been made, then the coordinating member commands the cluster to start the backup process nearly instantaneously on all members.

During this process, each member creates a sequenced backup subdirectory in the configured backup-dir directory with the name backup-<backupSeq>.

To make the backup process more performant, the contents of files in the persistence store are not duplicated. Instead, members create a new file name for the same persisted contents on disk, using hard links. If the hard link fails for any reason, members continue by copying the data, but future backups will still try to use hard links.

Backups are transactional and cluster-wide, so either all or none of the members start the same backup sequence.
For members to use hard links, your JDK must satisfy all requirements of the Files.createLink() method.

Triggering a Backup

To trigger a new backup, you can use one of the following options:

  • Java API

  • REST API

    Deprecation Notice for the REST API

    The REST API has been deprecated and will be removed as of Hazelcast version 7.0. An improved version of this feature is under development.

  • Management Center

Triggering a Backup in Java

  1. Put the cluster in a PASSIVE state.

    Backups may be initiated during membership changes, partition table changes, or during normal data updates. As a result, some members can have outdated versions of data before they start the backup process and copy the stale persisted data. By putting your cluster in a PASSIVE state, you can make data more consistent on all members.

  2. Trigger a backup.

    PersistenceService service = member.getCluster().getPersistenceService();
    service.backup();

    The sequence number in sequenced backup subdirectories is generated by the backup process, but you can define your own sequence numbers as shown below:

    PersistenceService service = member.getCluster().getPersistenceService();
    long backupSeq = ...
    service.backup(backupSeq);
    Backups fail if any member contains a sequenced backup subdirectory with the same name.
  3. Put your cluster back in an ACTIVE state.

    Once the backup method has returned, all cluster metadata is copied and the exact partition data which needs to be copied is marked. After that, the backup process continues asynchronously and you can return the cluster to the ACTIVE state and resume operations.

Monitoring the Backup Process

Only cluster and distributed object metadata is copied synchronously during the invocation of the backup method. The rest of the persistence store is copied asynchronously after the method call has ended. You can track the progress of the backup process, using one of the following options:

An example of how to track the progress via the Java API is shown below:

PersistenceService service = member.getCluster().getPersistenceService();
BackupTaskStatus status = service.getBackupTaskStatus();
...

The returned object contains the local member’s backup status:

  • The backup state (NOT_STARTED, IN_PROGRESS, FAILURE, SUCCESS)

  • The completed count

  • The total count

The completed and total count can provide you a way to track the percentage of the copied data. Currently the count defines the number of copied and total local member persistence stores (defined by PersistenceConfig.setParallelism()) but this can change at a later point to provide greater resolution.

Besides tracking the Persistence status by API, you can view the status in the Management Center and you can inspect the on-disk files for each member. Each member creates an inprogress file which is created in each of the copied persistence stores. This means that the backup is currently in progress. When the backup task completes the backup operation, this file is removed. If an error occurs during the backup task, the inprogress file is renamed to failure which contains a stack trace of the exception.

Interrupting and Canceling a Backup

Once the backup method call has returned and asynchronous copying of the partition data has started, the backup task can be interrupted. This is helpful in situations where the backup task has started at an inconvenient time. For instance, the backup task could be automated and it could be accidentally triggered during high load on the Hazelcast instances, causing the performance of the Hazelcast instances to drop.

The backup task mainly uses disk I/O, consumes little CPU and it generally does not last for a long time (although you should test it with your environment to determine the exact impact). Nevertheless, you can abort the backup tasks on all members via a cluster-wide interrupt operation. This operation can be triggered programmatically or from the Management Center.

An example of programmatic interruption is shown below:

PersistenceService service = member.getCluster().getPersistenceService();
service.interruptBackupTask();
...

This method sends an interrupt to all members. The interrupt is ignored if the backup task is currently not in progress so you can safely call this method even though it has previously been called or when some members have already completed their local backup tasks.

You can also interrupt the local member backup task as shown below:

PersistenceService service = member.getCluster().getPersistenceService();
service.interruptLocalBackupTask();
...

The backup task stops as soon as possible and it does not remove the disk contents of the backup directory, meaning that you must remove it manually.

Restoring from a Backup

To restore a cluster with data from a specific backup, do the following:

  1. Remove the files in your base-dir directory.

  2. Copy the contents of a sequenced subdirectory in your backup-dir directory to your base-dir directory.

  3. Restart the cluster.

To start a new cluster from the backups of an existing cluster, do the following for each existing member before starting the cluster: Copy the contents of an existing member’s backup subdirectory to the directory that’s configured in a new member’s base-dir directory.

The cluster on which you restore the backup must have the same number of members as the cluster that created the backups.