Hazelcast offers an extended set of diagnostics plugins for both Hazelcast members and clients. A dedicated log file is used to write the diagnostics content, and a rolling file approach is used to prevent taking up too much disk space.
To enable diagnostics logging, you should specify the following properties on the member side:
-Dhazelcast.diagnostics.enabled=true -Dhazelcast.diagnostics.metric.level=info -Dhazelcast.diagnostics.invocation.sample.period.seconds=30 -Dhazelcast.diagnostics.pending.invocations.period.seconds=30 -Dhazelcast.diagnostics.slowoperations.period.seconds=30 -Dhazelcast.diagnostics.storeLatency.period.seconds=60
On the client side, you should specify the following properties:
You can use the following property to specify the location of the diagnostics log file:
The name of the log file has the following format:
diagnostics-<host IP>#<port>-<unique ID>.log
The name of the log file can be prefixed with a custom string as shown below:
The content format of the diagnostics log file is depicted below:
<Date> BuildInfo[ <log content for BuildInfo diagnostics plugin>] <Date> SystemProperties[ <log content for SystemProperties diagnostics plugin>] <Date> ConfigProperties[ <log content for ConfigProperties diagnostics plugin>] <Date> Metrics[ <log content for Metrics diagnostics plugin>] <Date> SlowOperations[ <log content for SlowOperations diagnostics plugin>] <Date> HazelcastInstance[ <log content for HazelcastInstance diagnostics plugin>] ... ... ...
A rolling file approach is used to prevent creating too much data. By default 10 files of 50MB each are allowed to exist. The size of the file can be changed using the following property:
You can also set the number of files using the following property:
|The aforementioned rolling file approach only works within the lifetime of a member. When you restart a member, the size and count of the diagnostics log files are reset, and this leads to more files being created and additional disk space consumption. We recommend you to use a file archiving strategy of your own (cron jobs, scripts, etc.) paying attention to keep the recent log files for some time to examine the causes for a crash, for example, when and if it happens.|
As it is stated in the introduction of this section and shown in the log file content above, diagnostics utility consists of plugins such as BuildInfo, SystemProperties and HazelcastInstance.
It shows the detailed Hazelcast build information including the Hazelcast release number,
Git revision number and whether you have Hazelcast IMDG Enterprise or not.
It shows all the properties and their values in your system used by and configured for
your Hazelcast installation. These are the properties starting with
os. It also includes the arguments that are used to startup the JVM.
It shows the Hazelcast properties and their values explicitly set by you either
on the command line (with
-D) or by using declarative/programmatic configuration.
It shows a comprehensive log of what is happening in your Hazelcast system.
You can configure the frequency of dumping information to the log file using the following property:
hazelcast.diagnostics.metrics.period.seconds: Set a value in seconds. Its default value is
It shows the slow operations and invocations, See the SlowOperationDetector section for more information.
It shows all kinds of statistics about current and past invocations including current pending invocations, history of invocations and slow history, i.e., all samples where the invocation took more than the defined threshold. Slow history does not only include the invocations where the operations took a lot of time, but it also includes any other invocations that have been obstructed.
Using the following properties, you can configure the frequency of scanning all pending invocations and the threshold that makes an invocation to be considered as slow:
hazelcast.diagnostics.invocation.sample.period.seconds: Set a value in seconds. Its default value is
hazelcast.diagnostics.invocation.slow.threshold.seconds: Set a value in seconds. Its default value is
It shows invocation latencies for each operation. See an example output below:
06-05-2021 17:15:29 1557152129944 Invocations[ Pending History SlowHistory Profiler[ com.hazelcast.map.impl.query.QueryOperation[ count=400 totalTime(us)=56,000 avg(us)=140 max(us)=3,000 latency-distribution[ 0..99us=346 800..1599us=53 1600..3199us=1]] com.hazelcast.map.impl.operation.GetOperation[ count=100 totalTime(us)=19,000 avg(us)=190 max(us)=1,000 latency-distribution[ 0..99us=81 800..1599us=19]]
You can control the frequency of scanning all invocations using the following system property:
hazelcast.diagnostics.invocation-profiler.period.seconds: Set a value in seconds. Its default value is
5seconds. You can set it to
0to disable the plugin.
You can increase this period if you would like to decrease the logging noise.
It measures the time an operation runs on an operation thread; if the operation is a blocking one or being offloaded, only the time on the operation thread is measured. See an example output below:
06-05-2021 14:53:48 1595332428248 OperationsProfiler[ com.hazelcast.map.impl.operation.GetOperation[ count=502,501 totalTime(us)=1,690,645 avg(us)=3 max(us)=462 latency-distribution[ 1..2us=875 2..4us=359,876 4..8us=131,775 8..16us=8,720 16..32us=887 32..64us=178 64..128us=122 128..256us=62 256..512us=6]]
You can control the frequency of scanning all operations using the following system property:
hazelcast.diagnostics.operation-profiler.period.seconds: Set a value in seconds. Its default value is
5seconds. You can set it to
0to disable the plugin.
It shows the basic state of your Hazelcast cluster including the count and addresses of current members and the address of oldest cluster member. It is useful to get a fast impression of the cluster without needing to analyze a lot of data.
You can configure the frequency at which the cluster information is dumped to the log file using the following property:
hazelcast.diagnostics.memberinfo.period.second: Set a value in seconds. Its default value is
It checks the event queues in the data structures and samples the event types if the queue size is above a certain threshold. It is useful to figure out why the event queue is running full.
hazelcast.diagnostics.event.queue.period.seconds: Duration, in seconds, that this plugin runs, gathers information and writes to the diagnostics log file. When set to
0(its default value), it is disabled.
hazelcast.diagnostics.event.queue.threshold: Minimum number of events in the queue before it is being sampled. Its default value is
hazelcast.diagnostics.event.queue.samples: Number of samples to take from the event queue. Increasing the number of samples gives more accuracy of the content, but it has a negative performance effect. Its default value is
An example output for a Hazelcast map is as follows:
17-04-2019 17:36:37 EventQueues[ worker=1[ eventCount=441 sampleCount=100 samples[ IMap 'myMap' ADDED sampleCount=51 51.000% IMap 'myMap' REMOVED sampleCount=49 49.000%]]
It shows the activities in your cluster including when a connection/member is added or removed and if there is a change in the lifecycle of the cluster. It also includes the reasons for connection closings.
You can enable or disable the system log diagnostics plugin, and configure whether it shows information about partition migrations using the following properties:
hazelcast.diagnostics.systemlog.enabled: Its default value is
hazelcast.diagnostics.systemlog.partitions: Its default value is
false. Please note that if you enable this, you may get a lot of log entries if you have many partitions.
It shows statistics including the count of methods for each store (
loadAllKeys, etc.), average and maximum latencies for each store method calls and
latency distributions for each store. The following is an example output snippet as part of
the diagnostics log file for Hazelcast MapStore:
17-9-2019 13:12:34 MapStoreLatency[ map[ loadAllKeys[ count=1 totalTime(us)=8 avg(us)=8 max(us)=8 latency-distribution[ 0..99us=1]] load[ count=100 totalTime(us)=4,632,190 avg(us)=46,321 max(us)=99,178 latency-distribution[ 0..99us=1 1600..3199us=3 3200..6399us=3 6400..12799us=7 12800..25599us=13 25600..51199us=32 51200..102399us=41]]]]
According to your store usage, a similar output can be seen for Hazelcast JCache, Queue and Ringbuffer with persistent datastores.
You can control the StoreLatency plugin using the following properties:
hazelcast.diagnostics.storeLatency.period.seconds: The frequency this plugin is writing the collected information to the disk. By default it is disabled. A sensible production value would be
hazelcast.diagnostics.storeLatency.reset.period.seconds: The period of resetting the statistics. If, for example, it is set as 300 (5 minutes), all the statistics are cleared for every 5 minutes. By default it is 0, meaning that statistics are not reset.
It shows the deviation between member/member operation heartbeats.
Each member, regardless if there is an operation running on behalf of that member,
sends an operation heartbeat to every other member. It contains a listing of all
callIds of the running
operations from a given member.
This plugin also works fine between members/lite-members.
Because this operation heartbeat is sent periodically; by default 1/4 of the operation call timeout of 60 seconds, we would expect an operation heartbeat to be received every 15 seconds. Operation heartbeats are high priority packets (so they overtake regular packets) and are processed by an isolated thread in the invocation monitor. If there is any deviation in the frequency of receiving these packets, it may be due to the problems such as network latencies.
The following shows an example of the output where an operation heartbeat has not been received for 37 seconds:
20-7-2019 11:12:55 OperationHeartbeats[ member[10.212.1.119]:5701[ deviation(%)=146.6666717529297 noHeartbeat(ms)=37,000 lastHeartbeat(ms)=1,500,538,375,603 lastHeartbeat(date-time)=20-7-2017 11:12:55 now(ms)=1,500,538,338,603 now(date-time)=20-7-2017 11:12:18]]]
The OperationHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.
You can control the OperationHeartbeats plugin using the following properties:
hazelcast.diagnostics.operation-heartbeat.seconds: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin.
hazelcast.diagnostics.operation-heartbeat.max-deviation-percentage: The maximum allowed deviation percentage. Its default value is 33. For example, with a default 60 call timeout and operation heartbeat interval being 15 seconds, the maximum deviation with a deviation-percentage of 33, is 5 seconds. So there is no problem if a packet is arrived after 19 seconds, but if it arrives after 21 seconds, then the plugin renders.
This plugin looks a lot like the OperationHeartbeats plugin, but instead of relying on operation heartbeats to determine the deviation, it relies on member/member cluster heartbeats. Every member sends a heartbeat to other members periodically (by default every 5 seconds).
Just like the OperationHeartbeats, the MemberHeartbeats plugin can be used to detect if there are networking problems long before they actually lead to problems such as split-brain syndromes.
The following shows an example of the output where no member/member heartbeat has been received for 9 seconds:
20-7-2019 19:32:22 MemberHeartbeats[ member[10.212.1.119]:5701[ deviation(%)=80.0 noHeartbeat(ms)=9,000 lastHeartbeat(ms)=1,500,568,333,645 lastHeartbeat(date-time)=20-7-2017 19:32:13 now(ms)=1,500,568,342,645 now(date-time)=20-7-2017 19:32:22]]
The MemberHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.
You can control the MemberHeartbeats plugin using the following properties:
hazelcast.diagnostics.member-heartbeat.seconds: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin.
hazelcast.diagnostics.member-heartbeat.max-deviation-percentage: The maximum allowed deviation percentage. Its default value is 100. For example, if the interval of member/member heartbeats is 5 seconds, a 100% deviation is fine with heartbeats arriving up to 5 seconds after they are expected. So a heartbeat arriving after 9 seconds is not rendered, but a heartbeat received after 11 seconds is rendered.
This plugin samples the operation threads and checks the running operations/tasks. Hazelcast has the slow operation detector which is useful for very slow operations. But it may not be efficient for high volumes of not too slow operations. Using the OperationThreadSamples plugin it is more clear to see which operations are actually running.
You can control the OperationThreadSamples plugin using the following properties:
hazelcast.diagnostics.operationthreadsamples.period.seconds: The frequency this plugin is writing the collected information to the disk. An efficient value for production would be 30, 60 or more seconds. 0, which is the default value, disables the plugin.
hazelcast.diagnostics.operationthreadsamples.sampler.period.millis: The period in milliseconds between taking samples. The lower the value, the higher the overhead but also the higher the precision. Its default value is 100 ms.
hazelcast.diagnostics.operationthreadsamples.includeName: Specifies whether the data structures' name pointed to by the operation (if available) should be included in the name of the samples. Its default value is false.
The following shows an example of the output when the property
hazelcast.diagnostics.operationthreadsamples.includeName is false:
28-08-2019 07:40:07 1535442007330 OperationThreadSamples[ Partition[ com.hazelcast.map.impl.operation.MapSizeOperation=304623 85.6927% com.hazelcast.map.impl.operation.PutOperation=33061 9.300304% com.hazelcast.map.impl.operation.GetOperation=17799 5.0069904%] Generic[ com.hazelcast.client.impl.ClientEngineImpl$PriorityPartitionSpecificRunnable=2308 35.738617% com.hazelcast.nio.Packet=1767 27.361412% com.hazelcast.internal.cluster.impl.operations.JoinRequestOp=821 12.712914% com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation=278 4.3047385% com.hazelcast.internal.cluster.impl.operations.HeartbeatOp=93 1.4400743% com.hazelcast.internal.cluster.impl.operations.OnJoinOp=89 1.3781357% com.hazelcast.internal.cluster.impl.operations.WhoisMasterOp=75 1.1613503% com.hazelcast.client.impl.operations.ClientReAuthOperation=33 0.51099414%]]
As can be seen above, the
MapSizeOperations run on the operation threads most of the time.
The WAN diagnostics plugin provides information about the WAN replication.
It is disabled by default and can be configured using the following property:
hazelcast.diagnostics.wan.period.seconds: The frequency this plugin is writing the collected information to the disk. 0 disables the plugin.
The following shows an example of the output:
10-11-2019 14:11:32 1510319492497 WanBatchSenderLatency[ targetClusterName[ [127.0.0.1]:5801[ count=1 totalTime(us)=2,010,567 avg(us)=2,010,567 max(us)=2,010,567 latency-distribution[ 1638400..3276799us=1]] [127.0.0.1]:5802[ count=1 totalTime(us)=1,021,867 avg(us)=1,021,867 max(us)=1,021,867 latency-distribution[ 819200..1638399us=1]]]]