Production Checklist

The production checklist provides a set of best practices and recommendations to ensure a smooth transition to a production environment which runs a Hazelcast cluster. You should plan for and consider the following areas.

Network Recommendations

All Hazelcast members forming a cluster should be on a minimum 1Gbps Local Area Network (LAN).

Hardware Recommendations

We suggest at least 8 CPUs or equivalent per member, as well as running a single Hazelcast member per host.

Operating System Recommendations

Hazelcast works in many operating environments and some environments have unique considerations. These are highlighted below.

As a general suggestion, we recommend turning off the swapping at operating system level.

Solaris Sparc

Hazelcast is certified for Solaris Sparc. If you are using the former Hazelcast IMDG with versions older than 3.6, there is a known issue with High-Density Memory Store due to the Sparc architecture not supporting unaligned memory access.

VMWare ESX

Hazelcast is certified on VMWare VSphere 5.5/ESXi 6.0. Generally speaking, Hazelcast can use all of the resources on a full machine. Splitting a single physical machine into multiple virtual machines and thereby dividing resources are not required.

Consider the following for VMWare ESX:

  • Avoid over-committing memory. Always use dedicated physical memory for guests running Hazelcast.

  • Do not use memory ballooning.

  • Be careful over-committing the CPU, too. Watch for the CPU steal time.

  • Do not move guests while Hazelcast is running; disable vMotion. If you want to use vMotion (live migration), first stop the Hazelcast cluster and restart it after the migration.

  • Always enable verbose garbage collection (GC) logs; when "Real" time is higher than "User" time, then it may indicate some virtualization issues. JVM is off-CPU during GC (and probably waiting for I/O).

  • Note VMWare guests network types.

  • Use pass-through hard disks/partitions; do not to use image files.

  • Configure partition groups to use a separate underlying physical machine for partition backups.

  • If you want to use automatic snapshots, first stop the Hazelcast cluster and restart it after the snapshot.

  • Network performance issues, including timeouts, might occur with LRO (Large Receive Offload) enabled on Linux virtual machines and ESXi/ESX hosts. We have specifically had this reported in VMware environments, but it could potentially impact other environments as well. We strongly recommend disabling LRO when running in virtualized environments, see https://kb.vmware.com/s/article/1027511.

Windows

According to a reported rare case, I/O threads can consume a lot of CPU cycles unexpectedly, even in an idle state. This can lead to CPU usage going up to 100%. This is reported not only for Hazelcast but for other GitHub projects as well. Workaround for such cases is to supply the system property -Dhazelcast.io.selectorMode=selectwithfix on JVM startup. See the related GitHub issue for more details.

JVM Recommendations

In most cases, we don’t recommend more than 16 GB per JVM heap to avoid long garbage collection (GC) pauses and latencies. If High-Density memory is available, no more than 8 GB heap is recommended. Otherwise, horizontal scaling should be considered rather than vertical.

General recommendations;

  • GC logs should be enabled

  • Minimum and maximum heap size should be equal

For Java 9+;

  • G1GC is the default recommended GC policy

  • No tuning is recommended unless needed

For Java 8;

  • Recommended GC policies are CMS and ParNewGC:

    • -XX:CMSInitiatingOccupancyFraction=65

    • -XX:+UseParNewGC

    • -XX:+UseConcMarkSweepGC

  • For large heaps G1GC is recommended as above

Data Size Calculation Recommendations

The total data size should be calculated based on primary and backup data and you should increase the partition count accordingly to make sure a partition size does not exceed 50MB.

Partition Size/Count Calculation Recommendations

An optimal partition count and size establish a balance between the number partitions on each member and the data amount on each partition. You can consider the following when deciding on a partition count.

  • The partition count should be a prime number. This helps minimizing the collision of keys across partitions, ensuring more consistent lookup times.

  • A too low partition count is constraining; it should be large enough for a balanced data or task distribution so that each member does not manage too few partitions.

  • A partition size not exceeding 50MB typically ensures a good performance. For larger clusters, you can use the 50MB-100MB range; keep in mind that cluster performance may be affected adversely since larger heaps with more overhead space are required in this case.

If you are a Hazelcast Enterprise customer using the High-Density Data Store with large data sizes, we recommend a large increase in partition count, starting with 5009 or higher.

The partition count cannot be changed after a cluster is created, so if you have a larger cluster, be sure to test and set an optimum partition count prior to deployment. If you need to change the partition count after a cluster is running, you will need to schedule a maintenance window to update the partition count and restart the cluster.

Large Cluster Configuration Recommendations

You may need to increase the partition count relative to the number of Hazelcast members in the cluster and the data amount on each partition.