Persistence Design Details

Hazelcast’s Persistence uses a log-structured storage approach. The following is a top-level design description:

  • The only kind of update operation on persistent data is appending.

  • What is appended are facts about events that happened to the data model represented by the store; either a new value was assigned to a key or a key was removed.

  • Each record associated with a key makes stale the previous record that was associated with that key.

  • Stale records contribute to the amount of garbage present in the persistent storage.

  • Measures are taken to remove garbage from the storage.

This kind of design focuses almost all the system’s complexity into the garbage collection (GC) process, stripping down the client’s operation to the bare necessity of guaranteeing persistent behavior: a simple file append operation. Consequently, the latency of operations is close to the theoretical minimum in almost all cases. Complications arise only during prolonged periods of maximum load; this is where the details of the GC process begin to matter.

Concurrent, Incremental, Generational GC

In order to maintain the lowest possible footprint in the update operation latency, the following properties are built into the garbage collection process:

  • A dedicated thread performs the GC. In Hazelcast terms, this thread is called the Collector and the application thread is called the Mutator.

  • On each update there is metadata to be maintained; this is done asynchronously by the Collector thread. The Mutator enqueues update events to the Collector’s work queue.

  • The Collector keeps draining its work queue at all times, including the time it goes through the GC cycle. Updates are taken into account at each stage in the GC cycle, preventing the copying of already dead records into compacted files.

  • All GC-induced I/O competes for the same resources as the Mutator’s update operations. Therefore, measures are taken to minimize the impact of I/O done during GC:

    • data is never read from files, but from RAM

    • a heuristic scheme is employed which minimizes the number of bytes written to the disk for each kilobyte of the reclaimed garbage

    • measures are also taken to achieve a good interleaving of Collector and Mutator operations, minimizing latency outliers perceived by the Mutator

I/O Minimization Scheme

The success of this scheme is subject to a bet on the Weak Generational Garbage Hypothesis, which states that a new record entering the system is likely to become garbage soon. In other words, a key updated now is more likely than average to be updated again soon.

The scheme was taken from the seminal Sprite LFS paper, Rosenblum, Ousterhout, The Design and Implementation of a Log-Structured File System. The following is an outline of the paper:

  • Data is not written to one huge file, but to many files of moderate size (8 MB) called "chunks".

  • Garbage is collected incrementally, i.e. by choosing several chunks, then copying all their live data to new chunks, then deleting the old ones.

  • I/O is minimized using a collection technique which results in a bimodal distribution of chunks with respect to their garbage content: most files are either almost all live data or they are all garbage.

  • The technique consists of two main principles:

    • Chunks are selected based on their Cost-Benefit factor (see below).

    • Records are sorted by age before copying to new chunks.

Cost-Benefit Factor

The Cost-Benefit factor of a chunk consists of two components multiplied together:

  1. The ratio of benefit (amount of garbage that can be collected) to I/O cost (amount of live data to be written).

  2. The age of the data in the chunk, measured as the age of the youngest record it contains.

The essence is in the second component: given equal amount of garbage in all chunks, it makes the young ones less attractive to the Collector. Assuming the generational garbage hypothesis, this allows the young chunks to quickly accumulate more garbage. On the flip side, it also ensures that even files with little garbage are eventually garbage collected. This removes garbage which would otherwise linger on, thinly spread across many chunk files.

Sorting records by age groups the young records together in a single chunk and does the same for older records. Therefore the chunks are either tend to keep their data live for a longer time, or quickly become full of garbage.