A newer version of Hazelcast Operator is available.

View latest

Persistence, Backup and Restore

Persistence allows individual members and whole clusters to recover data by persisting map entries, JCache data, and streaming job snapshots on disk. Members can use persisted data to recover from a planned shutdown (including rolling upgrades), a sudden cluster-wide crash, or a single member failure.

This topic assumes that you know about Persistence in Hazelcast. To learn about Persistence, see the Platform documentation.

The Operator supports two volume options for enabling Persistence: PVC and HostPath. We recommend using PVC, so the examples on this page use this option. If you need to use HostPath, see HostPath Support for Persistence.

There are two options for a backup: local and external. Local backups are kept in volume and never moved anywhere whereas external backups are moved into buckets provided by the user.

For a working example, see this tutorial.

Enabling Persistence

Enabling Hazelcast is done with the following configuration.

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"  (1)
    clusterDataRecoveryPolicy: "FullRecoveryOnly"  (2)
    pvc:
      accessModes: ["ReadWriteOnce"]
      requestStorage: 20Gi  (3)
  agent: (4)
    repository: hazelcast/platform-operator-agent
1 Base directory of the backup data.
2 Cluster recovery policy.
3 Size of the PersistentVolumeClaim (PVC) where Hazelcast data is persisted.
4 Agent responsible for moving data from the local storage to external buckets. The agent configuration is optional. If you enable persistence and do not pass the agent configuration, Hazelcast Platform Operator uses the latest agent version that is compatible with its version.
Make sure to calculate the total disk space that you will use. The total used disk space may be larger than the size of in-memory data, depending on how many backups you take.

Triggering Local Backups

You can take local backups using the HotBackup custom resource. Local backups are kept in volume and are not moved anywhere.

apiVersion: hazelcast.com/v1alpha1
kind: HotBackup
metadata:
  name: hot-backup
spec:
  hazelcastResourceName: hazelcast

Triggering External Backups

In some cases, keeping the data only at PVC and restoring the data from it is not enough. For example, moving data between two Kubernetes clusters. You can use external storages so they will become portable.

When persistence is enabled Hazelcast pod will start with a sidecar agent which will upload the backups into an external bucket provided by the user.

To trigger an external backup, you need to configure a bucket URI and a secret to tell Hazelcast where to store backup data and how to authenticate.

apiVersion: hazelcast.com/v1alpha1
kind: HotBackup
metadata:
  name: hot-backup
spec:
  hazelcastResourceName: hazelcast
  bucketURI: "s3://operator-backup" (1)
  secret: "br-secret-s3" (2)
1 The bucket URI where backup data will be stored in
2 Name of the secret with credentials for accessing the given Bucket URI.
  • AWS

  • GCP

  • Azure

See AWS Session to learn about authentication procedure.

kubectl create secret generic <secret-name> --from-literal=region=<region> \
	--from-literal=access-key-id=<access-key-id> \
	--from-literal=secret-access-key=<secret-access-key>

See Application Default Credentials to learn about authentication procedure.

kubectl create secret generic <secret-name> --from-file=google-credentials-path=<service_account_json_file>

See Azure Storage Account Keys to learn about authentication procedure.

kubectl create secret generic <secret-name> \
	--from-literal=storage-account=<storage-account> \
	--from-literal=storage-key=<storage-key>

Scheduling Backups

You can schedule backups using the schedule and hotBackupTemplate fields of the CronHotBackup resource. For more information about the CronHotBackup resource, see the API Reference.

apiVersion: hazelcast.com/v1alpha1
kind: CronHotBackup
metadata:
  name: cron-hot-backup
spec:
  schedule: "* 0-23/6 * * *"
  hotBackupTemplate:
    spec:
      hazelcastResourceName: hazelcast

The schedule field takes a valid cron expression. For example, you can configure the following scheduled backups:

30 10 * * *

At 10:30 AM every day

0, 0, 1,15,25 * *

On 1st, 15th, and 25th of each month at midnight

@monthly

On the first day of each month at midnight

For a full list of supported expressions, see the library documentation.

Checking the Status of a Backup

To check the status of a local backup, run the following command:

kubectl get hotbackup hot-backup

The status of the backup is displayed in the output.

NAME         STATUS
hot-backup   Success

Restoring from Local Backups

To restore a cluster from local backups, you can directly reapply the Hazelcast resource, which gives the cluster access to the PVCs that contain the persisted data. This will restore the Hazelcast cluster from existing hot-restart folders.

Or, to restore from local backups that you have taken using the HotBackup resource, give the HotBackup resource name in the restore configuration. For the restore to work correctly, make sure the status of the HotBackup resource is Success.

When this restore mechanism is used, the Restore Agent container is deployed with the Hazelcast container in the same Pod. The agent starts as an initContainer before the Hazelcast container.

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    clusterDataRecoveryPolicy: "FullRecoveryOnly"
    pvc:
      accessModes: ["ReadWriteOnce"]
      requestStorage: 20Gi
    restore:
      hotBackupResourceName: hot-backup (1)
  agent: (2)
    repository: hazelcast/platform-operator-agent
1 HotBackup resource name used for restore. The backup folder name will be taken from the HotBackup resource.
2 Agent responsible for restoring data from the local storage. The agent configuration is optional. If you give restore under persistence and do not pass the agent configuration, Hazelcast Platform Operator uses the latest agent version that is compatible with its version.
For restore to be successful you have to give restore configuration at the Hazelcast custom resource creation. Otherwise, triggering restore on currently running members will likely to fail.

Restoring from External Backups

To restore a cluster from external backups, you can either set up the bucket configuration or give the HotBackup resource name that you used to trigger the external backup. In either case, the backup is restored from the external bucket.

When this restore mechanism is used, the Restore Agent container is deployed with the Hazelcast container in the same Pod. The agent starts as an initContainer before the Hazelcast container.

If you have not created the secret, you must do so in the same way as in Triggering External Backups.
  • Bucket Configuration

  • HotBackup resource name

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    clusterDataRecoveryPolicy: "FullRecoveryOnly"
    pvc:
      accessModes: ["ReadWriteOnce"]
      requestStorage: 20Gi
    restore:
      bucketConfig:
        bucketURI: "s3://operator-backup?prefix=hazelcast/2022-06-08-17-01-20/" (1)
        secret: br-secret-s3 (2)
  agent: (3)
    repository: hazelcast/platform-operator-agent
1 Bucket URI where backup data will be restored from.
2 Name of the secret with credentials for accessing the given bucket URI.
3 Agent which is responsible for restoring data from the external storage. The agent configuration is optional. If you provide restore under persistence and do not pass the agent configuration, Hazelcast Platform Operator uses the latest agent version that is compatible with its version.
apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    clusterDataRecoveryPolicy: "FullRecoveryOnly"
    pvc:
      accessModes: ["ReadWriteOnce"]
      requestStorage: 20Gi
    restore:
      hotBackupResourceName: hot-backup (1)
  agent: (2)
    repository: hazelcast/platform-operator-agent
1 HotBackup resource name used for the restore. The bucket URI and secret are taken from the HotBackup resource.
2 Agent responsible for restoring data from external storage. The agent configuration is optional. If you provide restore under persistence and do not pass the agent configuration, Hazelcast Platform Operator uses the latest agent version that is compatible with its version.
For restore to be successful you have to give restore configuration at the Hazelcast custom resource creation. Otherwise, triggering restore on currently running members will likely to fail.

Configuring Persistence

Data Recovery Timeout

To choose a data recovery timeout, you can use dataRecoveryTimeout. The field takes an integer value representing the timeout in seconds and uses this value to set validation-timeout-seconds, data-load-timeout-seconds Hazelcast Persistence options.

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    clusterDataRecoveryPolicy: "FullRecoveryOnly"
    dataRecoveryTimeout: 600
    pvc:
      accessModes: ["ReadWriteOnce"]
      requestStorage: 20Gi

Choosing a Cluster Recovery Policy

To decide how a cluster should behave when one or more members cannot rejoin after a cluster-wide restart, you can define one of the following cluster recovery policies. The Operator supports all the policies in the Hazelcast Platform cluster-data-recovery-policy configuration options. For complete descriptions and advice on choosing a policy, see the Platform documentation.

FullRecoveryOnly

Does not allow partial start of the cluster.

PartialRecoveryMostRecent

Allows partial start with the members that have most recent partition table.

PartialRecoveryMostComplete

Allows partial start with the member that have most complete partition table.

Configuring Force-Start

If you use the FullRecoveryOnly policy, you can configure the Operator to detect failed Hazelcast members and automatically trigger a force-start. The Operator will trigger a force-start only if the cluster is in a PASSIVE state.

The cluster loses all persisted data after a force-start.

Enable autoForceStart:

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  version: '5.1.4-slim'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    autoForceStart: true

HostPath Support for Persistence

You can also use HostPath to enable persistence.

HostPath support is discouraged for the production environments for the reasons mentioned in the Kubernetes documentation
HostPath support expects the size of the cluster to be equal to the number of Kubernetes nodes and pods are distributed to the nodes equally. You can manage how pods are distributed among nodes by setting the topologySpreadContraints field, which is described in Scheduling Hazelcast Pods.

Create the Hazelcast resource with the clusterSize equal to the number of Kubernetes nodes and give the proper topologySpreadConstraints.

+

apiVersion: hazelcast.com/v1alpha1
kind: Hazelcast
metadata:
  name: hazelcast
spec:
  clusterSize: 3
  repository: 'docker.io/hazelcast/hazelcast-enterprise'
  licenseKeySecret: hazelcast-license-key
  persistence:
    baseDir: "/data/hot-restart/"
    clusterDataRecoveryPolicy: "FullRecoveryOnly"
    hostPath: "/tmp/hazelcast"
  scheduling:
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: hazelcast