Diagnostics

Hazelcast offers an extended set of diagnostics plugins for both Hazelcast members and clients. A dedicated log file is used to write the diagnostics content, and a rolling file approach is used to prevent taking up too much disk space.

Enabling Diagnostics Logging

To enable diagnostics logging, you should specify the following properties on the member side:

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
-Dhazelcast.diagnostics.storeLatency.period.seconds=60

On the client side, you should specify the following properties:

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info

Diagnostics Log File

You can use the following property to specify the location of the diagnostics log file:

-Dhazelcast.diagnostics.directory=/your/log/directory

The name of the log file has the following format:

diagnostics-<host IP>#<port>-<unique ID>.log

The name of the log file can be prefixed with a custom string as shown below:

-Dhazelcast.diagnostics.filename.prefix=foobar

The content format of the diagnostics log file is depicted below:

<Date> BuildInfo[
	<log content for BuildInfo diagnostics plugin>]
<Date> SystemProperties[
	<log content for SystemProperties diagnostics plugin>]
<Date> ConfigProperties[
	<log content for ConfigProperties diagnostics plugin>]
<Date> Metrics[
	<log content for Metrics diagnostics plugin>]
<Date> SlowOperations[
	<log content for SlowOperations diagnostics plugin>]
<Date> HazelcastInstance[
	<log content for HazelcastInstance diagnostics plugin>]
...
...
...

A rolling file approach is used to prevent creating too much data. By default 10 files of 50MB each are allowed to exist. The size of the file can be changed using the following property:

-Dhazelcast.diagnostics.max.rolled.file.size.mb=100

You can also set the number of files using the following property:

-Dhazelcast.diagnostics.max.rolled.file.count=5

The aforementioned rolling file approach only works within the lifetime of a member. When you restart a member, the size and count of the diagnostics log files are reset, and this leads to more files being created and additional disk space consumption. We recommend you to use a file archiving strategy of your own (cron jobs, scripts, etc.) paying attention to keep the recent log files for some time to examine the causes for a crash, for example, when and if it happens.

Diagnostics Plugins

As it is stated in the introduction of this section and shown in the log file content above, diagnostics utility consists of plugins such as BuildInfo, SystemProperties and HazelcastInstance.

BuildInfo

It shows the detailed Hazelcast build information including the Hazelcast release number, Git revision number and whether you have Hazelcast IMDG Enterprise or not.

SystemProperties

It shows all the properties and their values in your system used by and configured for your Hazelcast installation. These are the properties starting with java (excluding java.awt), hazelcast, sun and os. It also includes the arguments that are used to startup the JVM.

ConfigProperties

It shows the Hazelcast properties and their values explicitly set by you either on the command line (with -D) or by using declarative/programmatic configuration.

Metrics

It shows a comprehensive log of what is happening in your Hazelcast system.

You can configure the frequency of dumping information to the log file using the following property:

hazelcast.diagnostics.metrics.period.seconds: Set a value in seconds. Its default value is 60 seconds.

SlowOperations

It shows the slow operations and invocations, See the SlowOperationDetector section for more information.

Invocations

It shows all kinds of statistics about current and past invocations including current pending invocations, history of invocations and slow history, i.e., all samples where the invocation took more than the defined threshold. Slow history does not only include the invocations where the operations took a lot of time, but it also includes any other invocations that have been obstructed.

Using the following properties, you can configure the frequency of scanning all pending invocations and the threshold that makes an invocation to be considered as slow:

hazelcast.diagnostics.invocation.sample.period.seconds: Set a value in seconds. Its default value is 60 seconds.
hazelcast.diagnostics.invocation.slow.threshold.seconds: Set a value in seconds. Its default value is 5 seconds.

InvocationProfiler

It shows invocation latencies for each operation. See an example output below:

06-05-2021 17:15:29 1557152129944 Invocations[
                          Pending[]
                          History[]
                          SlowHistory[]
                          Profiler[
                                  com.hazelcast.map.impl.query.QueryOperation[
                                          count=400
                                          totalTime(us)=56,000
                                          avg(us)=140
                                          max(us)=3,000
                                          latency-distribution[
                                                  0..99us=346
                                                  800..1599us=53
                                                  1600..3199us=1]]
                                  com.hazelcast.map.impl.operation.GetOperation[
                                          count=100
                                          totalTime(us)=19,000
                                          avg(us)=190
                                          max(us)=1,000
                                          latency-distribution[
                                                  0..99us=81
                                                  800..1599us=19]]

You can control the frequency of scanning all invocations using the following system property:

hazelcast.diagnostics.invocation-profiler.period.seconds: Set a value in seconds. Its default value is 5 seconds. You can set it to 0 to disable the plugin.

You can increase this period if you would like to decrease the logging noise.

OperationProfiler

It measures the time an operation runs on an operation thread; if the operation is a blocking one or being offloaded, only the time on the operation thread is measured. See an example output below:

06-05-2021 14:53:48 1595332428248 OperationsProfiler[
                          com.hazelcast.map.impl.operation.GetOperation[
                                  count=502,501
                                  totalTime(us)=1,690,645
                                  avg(us)=3
                                  max(us)=462
                                  latency-distribution[
                                          1..2us=875
                                          2..4us=359,876
                                          4..8us=131,775
                                          8..16us=8,720
                                          16..32us=887
                                          32..64us=178
                                          64..128us=122
                                          128..256us=62
                                          256..512us=6]]

You can control the frequency of scanning all operations using the following system property:

hazelcast.diagnostics.operation-profiler.period.seconds: Set a value in seconds. Its default value is 5 seconds. You can set it to 0 to disable the plugin.

HazelcastInstance

It shows the basic state of your Hazelcast cluster including the count and addresses of current members and the address of oldest cluster member. It is useful to get a fast impression of the cluster without needing to analyze a lot of data.

You can configure the frequency at which the cluster information is dumped to the log file using the following property:

hazelcast.diagnostics.memberinfo.period.second: Set a value in seconds. Its default value is 60 seconds.

EventQueue

It checks the event queues in the data structures and samples the event types if the queue size is above a certain threshold. It is useful to figure out why the event queue is running full.

hazelcast.diagnostics.event.queue.period.seconds: Duration, in seconds, that this plugin runs, gathers information and writes to the diagnostics log file. When set to 0 (its default value), it is disabled.
hazelcast.diagnostics.event.queue.threshold: Minimum number of events in the queue before it is being sampled. Its default value is 1000.
hazelcast.diagnostics.event.queue.samples: Number of samples to take from the event queue. Increasing the number of samples gives more accuracy of the content, but it has a negative performance effect. Its default value is 100.

An example output for a Hazelcast map is as follows:

17-04-2019 17:36:37 EventQueues[
    worker=1[
        eventCount=441
        sampleCount=100
        samples[
            IMap 'myMap' ADDED sampleCount=51 51.000%
            IMap 'myMap' REMOVED sampleCount=49 49.000%]]

SystemLog

It shows the activities in your cluster including when a connection/member is added or removed and if there is a change in the lifecycle of the cluster. It also includes the reasons for connection closings.

You can enable or disable the system log diagnostics plugin, and configure whether it shows information about partition migrations using the following properties:

hazelcast.diagnostics.systemlog.enabled: Its default value is true.
hazelcast.diagnostics.systemlog.partitions: Its default value is false. Please note that if you enable this, you may get a lot of log entries if you have many partitions.

StoreLatency

It shows statistics including the count of methods for each store (load, loadAll, loadAllKeys, etc.), average and maximum latencies for each store method calls and latency distributions for each store. The following is an example output snippet as part of the diagnostics log file for Hazelcast MapStore:

17-9-2019 13:12:34 MapStoreLatency[
    map[
        loadAllKeys[
            count=1
            totalTime(us)=8
            avg(us)=8
            max(us)=8
            latency-distribution[
                0..99us=1]]
        load[
            count=100
            totalTime(us)=4,632,190
            avg(us)=46,321
            max(us)=99,178
            latency-distribution[
                0..99us=1
                1600..3199us=3
                3200..6399us=3
                6400..12799us=7
                12800..25599us=13
                25600..51199us=32
                51200..102399us=41]]]]

According to your store usage, a similar output can be seen for Hazelcast JCache, Queue and Ringbuffer with persistent datastores.

You can control the StoreLatency plugin using the following properties:

hazelcast.diagnostics.storeLatency.period.seconds: The frequency this plugin is writing the collected information to the disk. By default it is disabled. A sensible production value would be 60 seconds.
hazelcast.diagnostics.storeLatency.reset.period.seconds: The period of resetting the statistics. If, for example, it is set as 300 (5 minutes), all the statistics are cleared for every 5 minutes. By default it is 0, meaning that statistics are not reset.

OperationHeartbeats

It shows the deviation between member/member operation heartbeats. Each member, regardless if there is an operation running on behalf of that member, sends an operation heartbeat to every other member. It contains a listing of all callIds of the running operations from a given member. This plugin also works fine between members/lite-members.

Because this operation heartbeat is sent periodically; by default 1/4 of the operation call timeout of 60 seconds, we would expect an operation heartbeat to be received every 15 seconds. Operation heartbeats are high priority packets (so they overtake regular packets) and are processed by an isolated thread in the invocation monitor. If there is any deviation in the frequency of receiving these packets, it may be due to the problems such as network latencies.

The following shows an example of the output where an operation heartbeat has not been received for 37 seconds:

20-7-2019 11:12:55 OperationHeartbeats[
    member[10.212.1.119]:5701[
        deviation(%)=146.6666717529297
        noHeartbeat(ms)=37,000
        lastHeartbeat(ms)=1,500,538,375,603
        lastHeartbeat(date-time)=20-7-2017 11:12:55
        now(ms)=1,500,538,338,603
        now(date-time)=20-7-2017 11:12:18]]]

The OperationHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.

You can control the OperationHeartbeats plugin using the following properties:

hazelcast.diagnostics.operation-heartbeat.seconds: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin.
hazelcast.diagnostics.operation-heartbeat.max-deviation-percentage: The maximum allowed deviation percentage. Its default value is 33. For example, with a default 60 call timeout and operation heartbeat interval being 15 seconds, the maximum deviation with a deviation-percentage of 33, is 5 seconds. So there is no problem if a packet is arrived after 19 seconds, but if it arrives after 21 seconds, then the plugin renders.

MemberHeartbeats

This plugin looks a lot like the OperationHeartbeats plugin, but instead of relying on operation heartbeats to determine the deviation, it relies on member/member cluster heartbeats. Every member sends a heartbeat to other members periodically (by default every 5 seconds).

Just like the OperationHeartbeats, the MemberHeartbeats plugin can be used to detect if there are networking problems long before they actually lead to problems such as split-brain syndromes.

The following shows an example of the output where no member/member heartbeat has been received for 9 seconds:

20-7-2019 19:32:22 MemberHeartbeats[
    member[10.212.1.119]:5701[
        deviation(%)=80.0
        noHeartbeat(ms)=9,000
        lastHeartbeat(ms)=1,500,568,333,645
        lastHeartbeat(date-time)=20-7-2017 19:32:13
        now(ms)=1,500,568,342,645
        now(date-time)=20-7-2017 19:32:22]]

The MemberHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.

You can control the MemberHeartbeats plugin using the following properties:

hazelcast.diagnostics.member-heartbeat.seconds: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin.
hazelcast.diagnostics.member-heartbeat.max-deviation-percentage: The maximum allowed deviation percentage. Its default value is 100. For example, if the interval of member/member heartbeats is 5 seconds, a 100% deviation is fine with heartbeats arriving up to 5 seconds after they are expected. So a heartbeat arriving after 9 seconds is not rendered, but a heartbeat received after 11 seconds is rendered.

OperationThreadSamples

This plugin samples the operation threads and checks the running operations/tasks. Hazelcast has the slow operation detector which is useful for very slow operations. But it may not be efficient for high volumes of not too slow operations. Using the OperationThreadSamples plugin it is more clear to see which operations are actually running.

You can control the OperationThreadSamples plugin using the following properties:

hazelcast.diagnostics.operationthreadsamples.period.seconds: The frequency this plugin is writing the collected information to the disk. An efficient value for production would be 30, 60 or more seconds. 0, which is the default value, disables the plugin.
hazelcast.diagnostics.operationthreadsamples.sampler.period.millis: The period in milliseconds between taking samples. The lower the value, the higher the overhead but also the higher the precision. Its default value is 100 ms.
hazelcast.diagnostics.operationthreadsamples.includeName: Specifies whether the data structures' name pointed to by the operation (if available) should be included in the name of the samples. Its default value is false.

The following shows an example of the output when the property hazelcast.diagnostics.operationthreadsamples.includeName is false:

28-08-2019 07:40:07 1535442007330 OperationThreadSamples[
    Partition[
        com.hazelcast.map.impl.operation.MapSizeOperation=304623 85.6927%
        com.hazelcast.map.impl.operation.PutOperation=33061 9.300304%
        com.hazelcast.map.impl.operation.GetOperation=17799 5.0069904%]
    Generic[
        com.hazelcast.client.impl.ClientEngineImpl$PriorityPartitionSpecificRunnable=2308 35.738617%
        com.hazelcast.nio.Packet=1767 27.361412%
        com.hazelcast.internal.cluster.impl.operations.JoinRequestOp=821 12.712914%
        com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation=278 4.3047385%
        com.hazelcast.internal.cluster.impl.operations.HeartbeatOp=93 1.4400743%
        com.hazelcast.internal.cluster.impl.operations.OnJoinOp=89 1.3781357%
        com.hazelcast.internal.cluster.impl.operations.WhoisMasterOp=75 1.1613503%
        com.hazelcast.client.impl.operations.ClientReAuthOperation=33 0.51099414%]]

As can be seen above, the MapSizeOperations run on the operation threads most of the time.

WanDiagnostics

The WAN diagnostics plugin provides information about the WAN replication.

It is disabled by default and can be configured using the following property:

hazelcast.diagnostics.wan.period.seconds: The frequency this plugin is writing the collected information to the disk. 0 disables the plugin.

The following shows an example of the output:

10-11-2019 14:11:32 1510319492497 WanBatchSenderLatency[
    targetClusterName[
        [127.0.0.1]:5801[
            count=1
            totalTime(us)=2,010,567
            avg(us)=2,010,567
            max(us)=2,010,567
            latency-distribution[
                1638400..3276799us=1]]
         [127.0.0.1]:5802[
             count=1
             totalTime(us)=1,021,867
             avg(us)=1,021,867
             max(us)=1,021,867
             latency-distribution[
                 819200..1638399us=1]]]]