High-Density Memory Store

By default, data structures in Hazelcast store data on heap in serialized form for highest data compaction. However, these data structures are still subject to Java Garbage Collection (GC). Modern hardware has much more available memory. If you want to make use of that hardware and scale up by specifying higher heap sizes, GC becomes an increasing problem. The application faces long GC pauses that make the application unresponsive. Also, you may get out of memory errors if you fill your whole heap. Garbage collection, which is the automatic process that manages the application’s runtime memory, often forces you into configurations where multiple JVMs with small heaps (sizes of 2-4GB per member) run on a single physical hardware device to avoid garbage collection pauses. This results in oversized clusters to hold the data and leads to performance issues.

In Hazelcast Enterprise Edition, the High-Density Memory Store is Hazelcast’s enterprise in-memory storage solution. It solves garbage collection limitations so that applications can exploit hardware memory more efficiently without the need of oversized clusters. High-Density Memory Store is designed as a pluggable memory manager which enables multiple memory stores for different data structures. These memory stores are all accessible by a common access layer that scales up to massive amounts of the main memory on a single JVM by minimizing the GC pressure. High-Density Memory Store enables predictable application scaling and boosts performance and latency while minimizing garbage collection pauses.

This foundation includes, but is not limited to, storing keys and values next to the heap in a native memory region.

High-Density Memory Store is currently provided for the following Hazelcast features and implementations:

Map
JCache Implementation
Near Cache
Persistence
Java Client, when using the Near Cache for client
Web Session Replications
Hibernate 2nd Level Caching
Paging and Partition Predicates

Configure High-Density Memory Store

To use the High-Density Memory Store, the native memory usage must be enabled using the programmatic or declarative configuration. Also, you can configure its size, memory allocator type, minimum block size, page size and metadata space percentage.

The following are the configuration element descriptions:

capacity: The total native memory capacity to allocate. Its default value is 512 MB.
allocator type: Type of the memory allocator. Available values are as follows:
- STANDARD: This option is used internally by Hazelcast’s POOLED allocator type or for debugging/testing purposes.
  - With this option, the memory is allocated or deallocated using your operating system’s default memory manager.
  - It uses GNU C Library’s standard malloc() and free() methods which are subject to contention on multithreaded/multicore systems.
  - Memory operations may become slower when you perform a lot of small allocations and deallocations.
  - It may cause large memory fragmentations, unless you use a method in the background that emphasizes fragmentation avoidance, such as jemalloc(). Note that a large memory fragmentation can trigger the Linux Out of Memory Killer if there is no swap space enabled in your system. Even if the swap space is enabled, the killer can be again triggered if there is not enough swap space left.
  - If you still want to use the operating system’s default memory management, you can set the allocator type to STANDARD in your native memory configuration.
- POOLED: This is the default option, Hazelcast’s own pooling memory allocator.
  - With this option, memory blocks are managed using internal memory pools.
  - It allocates memory blocks, each of which has a 4MB page size by default, and splits them into chunks or merges them to create larger chunks when required. Sizing of these chunks follows the buddy memory allocation algorithm, i.e., power-of-two sizing.
  - It never frees memory blocks back to the operating system. It marks disposed memory blocks as available to be used later, meaning that these blocks are reusable.
  - Memory allocation and deallocation operations (except the ones requiring larger sizes than the page size) do not interact with the operating system mostly.
  - For memory allocation, it tries to find the requested memory size inside the internal memory pools. If it cannot be found, then it interacts with the operating system.
  - Please note that the buddy memory allocation system may cause internal fragmentation depending on the size of the memory allocation request and configured page sizes. Since memory is allocated in block sizes that are a power-of-two, a request that does not exactly match the available block size will result in allocation of the next largest block, potentially leaving unused space. For example, when using 128 KB page sizes, a 68 KB request will result in the allocation of a 128 KB block, wasting 60 KB.
  - Although the buddy memory allocation system reduces external fragmentation by merging contiguous and free buddy blocks of the same size into larger ones, it may still lead to external fragmentation if the system does not have sufficiently large free buddy blocks to allocate for the request. As a result, larger memory requests might be rejected, even if the total free memory is sufficient.
minimum block size: Minimum size of the blocks in bytes to split and fragment a page block to assign to an allocation request. It is used only by the POOLED memory allocator. Its default value is 16 bytes.
page size: Size of the page in bytes to allocate memory as a block. It is used only by the POOLED memory allocator. Its default value is 1 << 22 = 4194304 Bytes, about 4 MB.
metadata space percentage: Defines the percentage of the configured native memory that is estimated to be used for internal High-Density Memory data structures. It is used only by the POOLED memory allocator. Its default value is 12.5. Hazelcast will allow the actual metadata usage to exceed this defined threshold but will send warning logs. See Metadata.
persistent-memory: See Using the High-Density Memory Store with Persistent Memory Devices.

The following is the programmatic configuration example.

        Capacity capacity = new Capacity(512, MemoryUnit.MEGABYTES);
        NativeMemoryConfig nativeMemoryConfig =
                new NativeMemoryConfig()
                        .setAllocatorType(NativeMemoryConfig.MemoryAllocatorType.POOLED)
                        .setCapacity(capacity)
                        .setEnabled(true)
                        .setMinBlockSize(16)
                        .setPageSize(1 << 20);

The following is the declarative configuration example.

XML
YAML

<hazelcast>
    ...
    <native-memory allocator-type="POOLED" enabled="true">
        <capacity unit="MEGABYTES" value="512"/>
        <min-block-size>16</min-block-size>
        <page-size>4194304</page-size>
        <metadata-space-percentage>12.5</metadata-space-percentage>
        <persistent-memory>
            <directories>
                <directory numa-node="0">/mnt/pmem0</directory>
                <directory numa-node="1">/mnt/pmem1</directory>
            </directories>
        </persistent-memory>
    </native-memory>
    ...
</hazelcast>

hazelcast:
  native-memory:
    enabled: true
    allocator-type: POOLED
    capacity:
      unit: MEGABYTES
      value: 512
    min-block-size: 16
    page-size: 4194304
    metadata-space-percentage: 12.5
    persistent-memory:
        directories:
            - directory: /mnt/pmem0
              numa-node: 0
            - directory: /mnt/pmem1
              numa-node: 1

You can check whether there is enough free physical memory for the requested number of bytes using the system property hazelcast.hidensity.check.freememory. See the System Properties appendix on how to use Hazelcast system properties.

Enable High-Density Memory Store

You need to perform a rolling restart on your cluster to start using High-Density Memory Store in your maps.

For this you need to:

Update your cluster and map configurations:
- Enable the native memory in your cluster.
- Set the in-memory format of your maps on each member as NATIVE.
Perform a rolling restart of your members. See maintain-cluster:restart-cluster.adoc#rolling.

Your maps are now backed with High-Density Memory Store.

Metadata

Metadata refers to the off-heap memory used by Hazelcast internal data structures holding user data and for tracking allocated memory blocks or pages when using the POOLED allocator. Metadata is included when calculating the total committed off-heap memory.

There is no configurable limit for metadata usage — it will expand as needed to accommodate user data. This means it is possible for total committed off-heap memory to exceed the configured limit. This is an important consideration in deployment environments such as Kubernetes, where containers in Pods can be terminated if allocated memory exceeds configured limits.

While metadata usage expands as needed, this is not the case for user data. If you have 256MB native memory configured and 10MB of that is metadata, then you have 246MB available for storing your data. Exceeding that will result in a NativeOutOfMemoryError.

Estimating usage

You can estimate member sizing by benchmarking in a simulated production environment and monitoring the memory.usedMetadata metric.

The majority of metadata usage comes from the space allocated for the underlying data structures holding the entries for IMap and ICache. These data structures are similar to a hashmap, with a buffer expanding by a factor of two each time the number of entries exceeds a load factor of 0.6 on the existing buffer size. Each slot in the buffer costs 16 bytes and the minimum buffer size is 128.

You can define the rough cost of this structure in bytes for n entries with the following Python snippet:

import math as m

def mapCost(n):
    nextPow2 = 2 ** m.ceil(m.log2(n))
    return 16 * max(128, nextPow2 if n < 0.6 * nextPow2 else 2 * nextPow2)

Then for a cluster with p partitions and m members, you can estimate per member metadata usage for an IMap or ICache with b backups and n entries to be:

(p * (b + 1) * mapCost(int(n / p))) / m

Using the HazelcastJsonValue type in an IMap will roughly double the metadata storage costs because structural information will be stored for each entry.

Finally, the cost of tracking allocated memory blocks requires further metadata. Increase the above estimate by about 25% to cover this.

Using the High-Density Memory Store with persistent memory devices

To extend the memory available to the High-Density Memory Store, you can use persistent memory devices. This is a cost-efficient way to provide additional storage for data structures like IMap, ICache, and Near Cache.

Importantly, the High-Density Memory Store uses the memory provided by the persistent memory device as volatile memory. This means that all data stored on the device is lost when a Hazelcast member or cluster is restarted. To recover data from individual members or clusters after planned or unplanned shutdowns, you need to persist data on disk.

To use a persistent memory device, you don’t need to make any changes to your application code. The following example shows you how to configure the dual in-line memory modules (DIMMs) of Intel® Optane™ DC as the High-Density Memory Store in Hazelcast.

Although the example describes the configuration of a dual socket machine, this is not a requirement for using persistent memory devices.

Configuring Intel® Optane™

Deprecation Notice for Intel Optane

Intel® has discontinued support for Intel® Optane™ products.

Intel Optane support will be removed as of version 7.0.

Prerequisites:

Linux x86_64 operating system
Dual socket machine with both sockets populated with Intel® Optane™ DC DIMMs configured in interleaved mode
DIMMs mounted as /mnt/pmem0 and /mnt/pmem1, and known as NUMA node0 and node1 respectively

Persistent memory devices running on Linux x86_64 is the only configuration currently supported.

Example configuration

The optional persistent-memory element in the native-memory configuration block enables the use of the persistent memory device and defines the directories where this memory is mounted along with its operational mode.

Declarative configuration:

XML
YAML

<hazelcast>
    ...
    <native-memory allocator-type="POOLED" enabled="true">
        <size unit="GIGABYTES" value="100" />
        <persistent-memory enabled="true" mode="MOUNTED">
            <directories>
                <directory numa-node="0">/mnt/pmem0</directory>
                <directory numa-node="1">/mnt/pmem1</directory>
            </directories>
        </persistent-memory>
    </native-memory>
    ...
</hazelcast>

hazelcast:
  native-memory:
    enabled: true
    allocator-type: POOLED
    size:
      unit: GIGABYTES
      value: 100
    persistent-memory:
      enabled: true
      mode: MOUNTED
      directories:
        - directory: /mnt/pmem0
          numa-node: 0
        - directory: /mnt/pmem1
          numa-node: 1

Programmatic configuration:

Config config = new Config();
NativeMemoryConfig memoryConfig = new NativeMemoryConfig()
                .setEnabled(true)
                .setCapacity(new Capacity(100, MemoryUnit.GIGABYTES))
                .setAllocatorType(POOLED);
PersistentMemoryConfig pmemConfig = memoryConfig.getPersistentMemoryConfig()
                .setEnabled(true)
                .setMode(MOUNTED)
                .addDirectoryConfig(new PersistentMemoryDirectoryConfig("/mnt/pmem0", 0))
                .addDirectoryConfig(new PersistentMemoryDirectoryConfig("/mnt/pmem1", 1));
config.setNativeMemoryConfig(memoryConfig);

The following elements and attributes are also used in the configuration:

enabled: Specifies whether use of memory on the persistent memory device is enabled or not. The default value is false, that is use of the device is disabled.
mode: Defines the operational mode of the persistent memory device. Two modes are supported:
- MOUNTED: If you choose this mode, the memory is mounted into the file system (aka FS DAX).
- SYSTEM_MEMORY: If you choose this mode, Hazelcast uses the already onlined persistent memory portion of the system memory (aka KMEM DAX).
directories: A list of the mounted directories for the persistent memory device which are available for the storage of all data structures backed by the High-Density Memory Store. When you specify the mounted directories, the following rules apply:
- Use of memory on the persistent memory device is enabled automatically, you do not need to explicitly set the enabled attribute to true
- Set the mode to MOUNTED.
  
  If you don’t specify a directory for the persistent memory device, standard RAM is used instead.

Allocation strategies

Since on multi-socket machines there could be multiple mount points for persistent memory devices, the memory allocations need to follow an allocation strategy. Starting with 4.1, Hazelcast supports two allocation strategies:

Round-robin allocation strategy
NUMA-aware allocation strategy

Hazelcast’s memory allocator chooses and statically caches one of them for every allocator thread for the entire lifetime of the Hazelcast instance.

Round-robin allocation strategy

Hazelcast iterates over the configured memory directories and makes sure that every allocation is done in a different directory from the last. This is a best-effort attempt to distribute the allocations evenly on the memory modules, which is also important for memory utilization and performance. This is the default allocation strategy.

NUMA-aware allocation strategy

The persistent memory modules are mounted in memory slots just like the regular memory modules, and share the same memory bus. Therefore, the same NUMA locality concerns apply to the memory from a persistent memory device as regular memory. It is cheaper to access memory modules attached to the socket on which the current thread runs than from a different socket. The modules are typically referenced as NUMA-local and NUMA-remote memories. To achieve the best possible performance, Hazelcast implements a NUMA-aware allocation strategy to ensure that all accesses to memory modules are local, if certain conditions hold.

To enable this allocation strategy for a certain thread, the thread must be bounded to a single NUMA node, which means the kernel’s scheduler makes sure that the thread can only be scheduled on the CPUs of a single NUMA node. Starting with Hazelcast 4.1, this can be done with thread group granularity. For a detailed explanation, see Best practices.

Enabling the NUMA-aware allocation strategy for the operation threads can make the biggest impact on performance. For example:

-Dhazelcast.operation.thread.affinity=[0-9,20-29]:20,[10-19,30-39]:20

This configuration restricts all 40 operation threads to run on a single NUMA node on a dual-socket 40 core system, where:

node0’s CPU set is [0-9,20-29]
node1’s CPU set is [10-19,30-39].

To discover the NUMA nodes and their CPU sets, use the numactl -H command.

The second requirement for the NUMA-aware strategy is defining the NUMA node for every directory configured for the persistent memory device.

If both parts of the configuration are completed, the threads in the thread groups restricted to run on a single NUMA node will use the NUMA-aware allocation strategy. The rest of threads will still use the round-robin strategy. To identify the memory attached to a NUMA node, use the ndctl list -v -m fsdax command. From the output of ndctl you can you can check which mount point represents which persistent memory device.

Allocation overflowing

Since both allocation strategies try to allocate from a single memory directory, lack of capacity can prevent the chosen directory from serving the allocation request. In this case, both strategies take the other directories and try to serve the allocation from those. This fallback option compromises the NUMA-aware strategy as there will be NUMA-remote accesses to memory on the persistent memory device.

Performance of memory modules

While the memory modules of a persistent memory device are mounted next to the regular memory modules, and share the same memory bus, they have different performance characteristics.

Memory modules on the persistent memory device can be accessed with higher latency than the regular memory modules.
Performance of reads and writes on regular memory modules are the same. On a persistent memory device, the memory modules have an asymmetric performance profile, which means that the writes are slower than the reads.

Despite the above facts, whether the higher latency of the memory from persistent memory devices impacts the performance of Hazelcast depends on multiple factors.

Performance impacts

Hazelcast is a distributed platform so the higher latency of memory on a persistent memory device can easily be hidden by the latency variance of the network. For certain use cases there may be no observable difference in the throughput whether Hazelcast stores its data on memory from a persistent memory device or on regular memory. An example of this is caching, where accessing the entries remotely through Hazelcast clients results in a very similar throughput.

Other use cases that don’t involve networking, such as iterating over all entries with entry processors can be impacted by the higher latency of the memory modules on a persistent memory device, especially if the entry processors update a significant portion of the entries. For this type use case, the higher the entry size, the higher the impact on the performance. That means with smaller entry sizes the performance of Hazelcast with memory modules on a persistent memory device can be comparable to the performance with regular memory.