Production Checklist

The production checklist provides a set of best practices and recommendations to ensure a smooth transition to a production environment which runs a Hazelcast cluster. You should plan for and consider the following areas.

Network Recommendations

All Hazelcast members forming a cluster should be on a minimum 1Gbps Local Area Network (LAN).

Hardware Recommendations

We suggest at least 8 CPU cores or equivalent per member, as well as running a single Hazelcast member per host.

Operating System Recommendations

Hazelcast works in many operating environments and some environments have unique considerations. These are highlighted below.

As a general suggestion, we recommend turning off the swapping at operating system level.

Solaris

Hazelcast is certified for Solaris SPARC.

However, the following modules are not supported for the Solaris operating system:

hazelcast-jet-grpc
hazelcast-jet-protobuf
hazelcast-jet-python

VMWare ESX

Hazelcast is certified on VMWare VSphere 5.5/ESXi 6.0. Generally speaking, Hazelcast can use all of the resources on a full machine. Splitting a single physical machine into multiple virtual machines and thereby dividing resources are not required.

Consider the following for VMWare ESX:

Avoid sharing one Network Interface Card (NIC) between multiple virtual machine environments. A Hazelcast cluster is a distributed system and can be very network-intensive. Trying to share one physical NIC between multiple VMs may cause network-related performance problems.
Avoid over-committing memory. Always use dedicated physical memory for guests running Hazelcast.
Do not use memory ballooning.
Be careful overcommitting CPU cores. Monitor CPU steal time metrics.
Do not move guests while Hazelcast is running - for ESX this means disabling vMotion. If you want to use vMotion (live migration), first stop the Hazelcast cluster then restart it after the migration completes.
Always enable verbose garbage collection (GC) logs in the Java Virtual Machine. When "Real" time is higher than "User" time, this may indicate virtualization issues. The JVM is not using the CPU to execute application code during garbage collection, and is probably waiting on input/output (I/O) operations.
Note VMWare guests network types.
Use pass-through hard disks/partitions; do not use image files.
Configure partition groups to use a separate underlying physical machine for partition backups.
If you want to use automatic snapshots, first stop the Hazelcast cluster then restart it after the snapshot.
Network performance issues, including timeouts, might occur with LRO (Large Receive Offload) enabled on Linux virtual machines and ESXi/ESX hosts. We have specifically had this reported in VMware environments, but it could potentially impact other environments as well. Although this issue is observed only under certain conditions, we strongly recommend either using the e1000 device driver, or disabling LRO when running in virtualized environments. For more information, see https://kb.vmware.com/s/article/1027511.

Windows

According to a reported rare case, I/O threads can consume a lot of CPU cycles unexpectedly, even in an idle state. This can lead to CPU usage going up to 100%. This is reported not only for Hazelcast but for other GitHub projects as well. The workaround for such cases is to supply the system property -Dhazelcast.io.selectorMode=selectwithfix on JVM startup. See the related GitHub issue for more details.

JVM Recommendations

In order to avoid long garbage collection (GC) pauses and latencies from the Java Virtual Machine (JVM), we recommend 16GB or less of maximum JVM heap. If High-Density Memory is enabled, no more than 8GB of maximum JVM heap is recommended. Horizontal scaling of JVM memory is recommended over vertical scaling if you wish to exceed these numbers.

General recommendations:

GC logs should be enabled
Minimum and maximum heap size should be equal

For Java 9+:

G1GC is the default recommended GC policy
No tuning is recommended unless needed

For Java 8:

Recommended GC policies are CMS and ParNewGC:
- -XX:CMSInitiatingOccupancyFraction=65
- -XX:+UseParNewGC
- -XX:+UseConcMarkSweepGC
For large heaps G1GC is recommended as above

Data Size Calculation Recommendations

Total data size should be calculated based on the combination of primary data and backup data. For example, if you have configured your cluster with a backup count of 2, then total memory consumed is actually 3x larger than the primary data size (primary + backup + backup). Partition sizes of 50MB or less are recommended.

Partition Size/Count Calculation Recommendations

The number of internal partitions a Hazelcast member uses can be configured, but must be uniform across all members in the cluster. An optimal partition count and size establish a balance between the number of partitions on each member and the data amount on each partition. You can consider the following when deciding on a partition count.

The partition count should be a prime number. This helps to minimize the collision of keys across partitions, ensuring more consistent lookup times.
A partition count which is too low constrains the cluster. The count should be large enough for a balanced data or task distribution so that each member does not manage too few partitions.
A partition size of 50MB or less typically ensures good performance. Larger clusters may be able to use up to 100MB partition sizes, but will likely also require larger JVM heap sizes to accomodate the increase in data flow.

If you are a Hazelcast Enterprise customer using the High-Density Data Store with large data sizes, we recommend a large increase in partition count, starting with 5009 or higher.

The partition count cannot be easily changed after a cluster is created, so if you have a large cluster be sure to test and set an optimum partition count prior to deployment. If you need to change the partition count after a cluster is already running, you will need to schedule a maintenance window to entirely bring the cluster down. If your cluster uses the Persistence or CP Persistence features, those persistent files will need to be removed after the cluster is shut down, as they contain references to the previous partition count. Once all member configurations are updated, and any persistent data structure files are removed, the cluster can be safely restarted.