Diagnostics
Hazelcast offers an extended set of diagnostics plugins for both Hazelcast members and clients. A dedicated log file is used to write the diagnostics content, and a rolling file approach is used to prevent taking up too much disk space.
Enabling Diagnostics Logging
To enable diagnostics logging, you should specify the following properties on the member side:
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.invocation.sample.period.seconds=30
-Dhazelcast.diagnostics.pending.invocations.period.seconds=30
-Dhazelcast.diagnostics.slowoperations.period.seconds=30
-Dhazelcast.diagnostics.storeLatency.period.seconds=60
On the client side, you should specify the following properties:
-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
Diagnostics Log File
You can use the following property to specify the location of the diagnostics log file:
-Dhazelcast.diagnostics.directory=/your/log/directory
The name of the log file has the following format:
diagnostics-<host IP>#<port>-<unique ID>.log
The name of the log file can be prefixed with a custom string as shown below:
-Dhazelcast.diagnostics.filename.prefix=foobar
The content format of the diagnostics log file is depicted below:
<Date> BuildInfo[
<log content for BuildInfo diagnostics plugin>]
<Date> SystemProperties[
<log content for SystemProperties diagnostics plugin>]
<Date> ConfigProperties[
<log content for ConfigProperties diagnostics plugin>]
<Date> Metrics[
<log content for Metrics diagnostics plugin>]
<Date> SlowOperations[
<log content for SlowOperations diagnostics plugin>]
<Date> HazelcastInstance[
<log content for HazelcastInstance diagnostics plugin>]
...
...
...
A rolling file approach is used to prevent creating too much data. By default 10 files of 50MB each are allowed to exist. The size of the file can be changed using the following property:
-Dhazelcast.diagnostics.max.rolled.file.size.mb=100
You can also set the number of files using the following property:
-Dhazelcast.diagnostics.max.rolled.file.count=5
The aforementioned rolling file approach only works within the lifetime of a member. When you restart a member, the size and count of the diagnostics log files are reset, and this leads to more files being created and additional disk space consumption. We recommend you to use a file archiving strategy of your own (cron jobs, scripts, etc.) paying attention to keep the recent log files for some time to examine the causes for a crash, for example, when and if it happens. |
Diagnostics Plugins
As it is stated in the introduction of this section and shown in the log file content above, diagnostics utility consists of plugins such as BuildInfo, SystemProperties and HazelcastInstance.
BuildInfo
It shows the detailed Hazelcast build information including the Hazelcast release number,
Git
revision number and whether you have Hazelcast IMDG Enterprise or not.
SystemProperties
It shows all the properties and their values in your system used by and configured for
your Hazelcast installation. These are the properties starting with java
(excluding java.awt
),
hazelcast
, sun
and os
. It also includes the arguments that are used to startup the JVM.
ConfigProperties
It shows the Hazelcast properties and their values explicitly set by you either
on the command line (with -D
) or by using declarative/programmatic configuration.
Metrics
It shows a comprehensive log of what is happening in your Hazelcast system.
You can configure the frequency of dumping information to the log file using the following property:
-
hazelcast.diagnostics.metrics.period.seconds
: Set a value in seconds. Its default value is60
seconds.
SlowOperations
It shows the slow operations and invocations, See the SlowOperationDetector section for more information.
Invocations
It shows all kinds of statistics about current and past invocations including current pending invocations, history of invocations and slow history, i.e., all samples where the invocation took more than the defined threshold. Slow history does not only include the invocations where the operations took a lot of time, but it also includes any other invocations that have been obstructed.
Using the following properties, you can configure the frequency of scanning all pending invocations and the threshold that makes an invocation to be considered as slow:
-
hazelcast.diagnostics.invocation.sample.period.seconds
: Set a value in seconds. Its default value is60
seconds. -
hazelcast.diagnostics.invocation.slow.threshold.seconds
: Set a value in seconds. Its default value is5
seconds.
HazelcastInstance
It shows the basic state of your Hazelcast cluster including the count and addresses of current members and the address of oldest cluster member. It is useful to get a fast impression of the cluster without needing to analyze a lot of data.
You can configure the frequency at which the cluster information is dumped to the log file using the following property:
-
hazelcast.diagnostics.memberinfo.period.second
: Set a value in seconds. Its default value is60
seconds.
EventQueue
It checks the event queues in the data structures and samples the event types if the queue size is above a certain threshold. It is useful to figure out why the event queue is running full.
-
hazelcast.diagnostics.event.queue.period.seconds
: Duration, in seconds, that this plugin runs, gathers information and writes to the diagnostics log file. When set to0
(its default value), it is disabled. -
hazelcast.diagnostics.event.queue.threshold
: Minimum number of events in the queue before it is being sampled. Its default value is1000
. -
hazelcast.diagnostics.event.queue.samples
: Number of samples to take from the event queue. Increasing the number of samples gives more accuracy of the content, but it has a negative performance effect. Its default value is100
.
An example output for a Hazelcast map is as follows:
17-04-2019 17:36:37 EventQueues[
worker=1[
eventCount=441
sampleCount=100
samples[
IMap 'myMap' ADDED sampleCount=51 51.000%
IMap 'myMap' REMOVED sampleCount=49 49.000%]]
SystemLog
It shows the activities in your cluster including when a connection/member is added or removed and if there is a change in the lifecycle of the cluster. It also includes the reasons for connection closings.
You can enable or disable the system log diagnostics plugin, and configure whether it shows information about partition migrations using the following properties:
-
hazelcast.diagnostics.systemlog.enabled
: Its default value istrue
. -
hazelcast.diagnostics.systemlog.partitions
: Its default value isfalse
. Please note that if you enable this, you may get a lot of log entries if you have many partitions.
StoreLatency
It shows statistics including the count of methods for each store (load
, loadAll
,
loadAllKeys
, etc.), average and maximum latencies for each store method calls and
latency distributions for each store. The following is an example output snippet as part of
the diagnostics log file for Hazelcast MapStore:
17-9-2019 13:12:34 MapStoreLatency[
map[
loadAllKeys[
count=1
totalTime(us)=8
avg(us)=8
max(us)=8
latency-distribution[
0..99us=1]]
load[
count=100
totalTime(us)=4,632,190
avg(us)=46,321
max(us)=99,178
latency-distribution[
0..99us=1
1600..3199us=3
3200..6399us=3
6400..12799us=7
12800..25599us=13
25600..51199us=32
51200..102399us=41]]]]
According to your store usage, a similar output can be seen for Hazelcast JCache, Queue and Ringbuffer with persistent datastores.
You can control the StoreLatency plugin using the following properties:
-
hazelcast.diagnostics.storeLatency.period.seconds
: The frequency this plugin is writing the collected information to the disk. By default it is disabled. A sensible production value would be60
seconds. -
hazelcast.diagnostics.storeLatency.reset.period.seconds
: The period of resetting the statistics. If, for example, it is set as 300 (5 minutes), all the statistics are cleared for every 5 minutes. By default it is 0, meaning that statistics are not reset.
OperationHeartbeats
It shows the deviation between member/member operation heartbeats.
Each member, regardless if there is an operation running on behalf of that member,
sends an operation heartbeat to every other member. It contains a listing of all callId
s of the running
operations from a given member.
This plugin also works fine between members/lite-members.
Because this operation heartbeat is sent periodically; by default 1/4 of the operation call timeout of 60 seconds, we would expect an operation heartbeat to be received every 15 seconds. Operation heartbeats are high priority packets (so they overtake regular packets) and are processed by an isolated thread in the invocation monitor. If there is any deviation in the frequency of receiving these packets, it may be due to the problems such as network latencies.
The following shows an example of the output where an operation heartbeat has not been received for 37 seconds:
20-7-2019 11:12:55 OperationHeartbeats[
member[10.212.1.119]:5701[
deviation(%)=146.6666717529297
noHeartbeat(ms)=37,000
lastHeartbeat(ms)=1,500,538,375,603
lastHeartbeat(date-time)=20-7-2017 11:12:55
now(ms)=1,500,538,338,603
now(date-time)=20-7-2017 11:12:18]]]
The OperationHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.
You can control the OperationHeartbeats plugin using the following properties:
-
hazelcast.diagnostics.operation-heartbeat.seconds
: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin. -
hazelcast.diagnostics.operation-heartbeat.max-deviation-percentage
: The maximum allowed deviation percentage. Its default value is 33. For example, with a default 60 call timeout and operation heartbeat interval being 15 seconds, the maximum deviation with a deviation-percentage of 33, is 5 seconds. So there is no problem if a packet is arrived after 19 seconds, but if it arrives after 21 seconds, then the plugin renders.
MemberHeartbeats
This plugin looks a lot like the OperationHeartbeats plugin, but instead of relying on operation heartbeats to determine the deviation, it relies on member/member cluster heartbeats. Every member sends a heartbeat to other members periodically (by default every 5 seconds).
Just like the OperationHeartbeats, the MemberHeartbeats plugin can be used to detect if there are networking problems long before they actually lead to problems such as split-brain syndromes.
The following shows an example of the output where no member/member heartbeat has been received for 9 seconds:
20-7-2019 19:32:22 MemberHeartbeats[
member[10.212.1.119]:5701[
deviation(%)=80.0
noHeartbeat(ms)=9,000
lastHeartbeat(ms)=1,500,568,333,645
lastHeartbeat(date-time)=20-7-2017 19:32:13
now(ms)=1,500,568,342,645
now(date-time)=20-7-2017 19:32:22]]
The MemberHeartbeats plugin is enabled by default since it has very little overhead and only prints to the diagnostics file if the maximum deviation percentage (explained below) is exceeded.
You can control the MemberHeartbeats plugin using the following properties:
-
hazelcast.diagnostics.member-heartbeat.seconds
: The frequency this plugin is writing the collected information to the disk. It is configured to be 10 seconds by default. 0 disables the plugin. -
hazelcast.diagnostics.member-heartbeat.max-deviation-percentage
: The maximum allowed deviation percentage. Its default value is 100. For example, if the interval of member/member heartbeats is 5 seconds, a 100% deviation is fine with heartbeats arriving up to 5 seconds after they are expected. So a heartbeat arriving after 9 seconds is not rendered, but a heartbeat received after 11 seconds is rendered.
OperationThreadSamples
This plugin samples the operation threads and checks the running operations/tasks. Hazelcast has the slow operation detector which is useful for very slow operations. But it may not be efficient for high volumes of not too slow operations. Using the OperationThreadSamples plugin it is more clear to see which operations are actually running.
You can control the OperationThreadSamples plugin using the following properties:
-
hazelcast.diagnostics.operationthreadsamples.period.seconds
: The frequency this plugin is writing the collected information to the disk. An efficient value for production would be 30, 60 or more seconds. 0, which is the default value, disables the plugin. -
hazelcast.diagnostics.operationthreadsamples.sampler.period.millis
: The period in milliseconds between taking samples. The lower the value, the higher the overhead but also the higher the precision. Its default value is 100 ms. -
hazelcast.diagnostics.operationthreadsamples.includeName
: Specifies whether the data structures' name pointed to by the operation (if available) should be included in the name of the samples. Its default value is false.
The following shows an example of the output when the property
hazelcast.diagnostics.operationthreadsamples.includeName
is false:
28-08-2019 07:40:07 1535442007330 OperationThreadSamples[
Partition[
com.hazelcast.map.impl.operation.MapSizeOperation=304623 85.6927%
com.hazelcast.map.impl.operation.PutOperation=33061 9.300304%
com.hazelcast.map.impl.operation.GetOperation=17799 5.0069904%]
Generic[
com.hazelcast.client.impl.ClientEngineImpl$PriorityPartitionSpecificRunnable=2308 35.738617%
com.hazelcast.nio.Packet=1767 27.361412%
com.hazelcast.internal.cluster.impl.operations.JoinRequestOp=821 12.712914%
com.hazelcast.spi.impl.operationservice.impl.operations.PartitionIteratingOperation=278 4.3047385%
com.hazelcast.internal.cluster.impl.operations.HeartbeatOp=93 1.4400743%
com.hazelcast.internal.cluster.impl.operations.OnJoinOp=89 1.3781357%
com.hazelcast.internal.cluster.impl.operations.WhoisMasterOp=75 1.1613503%
com.hazelcast.client.impl.operations.ClientReAuthOperation=33 0.51099414%]]
As can be seen above, the MapSizeOperations
run on the operation threads most of the time.
WanDiagnostics
The WAN diagnostics plugin provides information about the WAN replication.
It is disabled by default and can be configured using the following property:
-
hazelcast.diagnostics.wan.period.seconds
: The frequency this plugin is writing the collected information to the disk. 0 disables the plugin.
The following shows an example of the output:
10-11-2019 14:11:32 1510319492497 WanBatchSenderLatency[
targetClusterName[
[127.0.0.1]:5801[
count=1
totalTime(us)=2,010,567
avg(us)=2,010,567
max(us)=2,010,567
latency-distribution[
1638400..3276799us=1]]
[127.0.0.1]:5802[
count=1
totalTime(us)=1,021,867
avg(us)=1,021,867
max(us)=1,021,867
latency-distribution[
819200..1638399us=1]]]]