A newer version of Platform is available.

View latest

Invocation Lifecycle

When a write is requested with the methods, such as map.put() or queue.offer(), a write operation is submitted to the Hazelcast member that owns the primary replica of the specific partition. Partition of an operation is determined based on a parameter (key of an entry or name of the data structure, etc.) related to that operation depending on the data structure. Target Hazelcast member is figured out by looking up a local partition assignment/ownership table, which is updated on each partition migration and broadcasted to all cluster eventually.

When a Hazelcast member receives a partition specific operation, it executes the operation and propagates it to backup replica(s) with a logical timestamp. Number of backups for each operation depends on the data structure and its configuration. See Threading Model - Operation Threading for threading details.

Two types of backup replication are available: sync and async. Despite what their names imply, both types are still implementations of the lazy (async) replication model. The only difference between sync and async is that, the former makes the caller block until backup updates are applied by backup replicas and acknowledgments are sent back to the caller, but the latter is just fire & forget. Number of sync and async backups are defined in the data structure configurations, and you can use a combination of sync and async backups.

When backup updates are propagated, response of the execution including number of sync backup updates is sent to the caller and after receiving the response, caller waits to receive the specified number of sync backup acknowledgements for a predefined timeout. This timeout is 5 seconds by default and defined by the system property hazelcast.operation.backup.timeout.millis (see System Properties appendix).

A backup update can be missed because of a few reasons, such as a stale partition table information about a backup replica member, network interruption, or a member crash. That’s why sync backup acks require a timeout to give up. Regardless of being a sync or async backup, if a backup update is missed, the periodically running anti-entropy mechanism detects the inconsistency and synchronizes backup replicas with the primary. Also the graceful shutdown procedure ensures that all backup replicas for partitions whose primary replicas are assigned to the shutting down member will be consistent.

In some cases, although the target member of an invocation is assumed to be alive by the failure detector, the target may not execute the operation or send the response back in time. Network splits, long pauses caused by high load, GC or I/O (disk, network) can be listed as a few possible reasons. When an invocation doesn’t receive any response from the member that owns primary replica, then invocation fails with an OperationTimeoutException. This timeout is 2 minutes by default and defined by the system property hazelcast.operation.call.timeout.millis (System Properties appendix). When timeout is passed, result of the invocation will be indeterminate.