CP Member Shutdown
Please read this part carefully to notice the behavioral difference in the CP member shutdown process when CP Subsystem Persistence is enabled and disabled. |
There is a significant behavioral difference during the CP member shutdown when CP
Subsystem Persistence is enabled and disabled. When disabled (the default mode
in which CP Subsystem works only in memory), a shutting down CP member is
replaced with other available CP members in all of its CP groups in order not
to decrease or more importantly not to lose majorities of CP groups. It is
because CP members keep their local state only in memory when CP Subsystem
Persistence is disabled, hence a shut-down CP member cannot join back with its
CP identity and state, hence it is better to remove it from CP Subsystem to not
to harm availability of CP groups. If there is no other available CP member to
replace a shutting down CP member in a CP group, that CP group’s size is
reduced by 1 and its majority value is recalculated. On the other hand, when CP
Subsystem Persistence is enabled, a shut-down CP member can come back by
restoring its CP state. Therefore, it is not automatically removed from CP
Subsystem when CP Subsystem Persistence is enabled. It is up to you to
remove shut-down CP members via
CPSubsystemManagementService.removeCPMember(String)
if they will not come
back.
In summary, CP member shutdown behavior is as follows:
-
When CP Subsystem Persistence is disabled (the default mode), shutting down CP members are removed from CP Subsystem and the CP group majority values are recalculated.
-
When CP Subsystem Persistence is enabled, shutting down CP members are still kept in CP Subsystem so they will be a part of the CP group majority calculations.
Moreover, there is a subtle point about concurrent shutdown of CP members when
CP Subsystem Persistence is disabled. If there are N
CP members in CP
Subsystem, HazelcastInstance.shutdown()
can be called on N-2
CP members
concurrently. Once these N-2
CP members complete their shutdown,
the remaining 2
CP members must be shut down serially. Even though
the shutdown API can be called concurrently on multiple members, the METADATA
CP group handles shutdown requests serially. Therefore, it would be simpler to
shut down CP members one by one, by calling HazelcastInstance.shutdown()
on
the next CP member once the current CP member completes its shutdown. This rule
does not apply when CP Subsystem Persistence is enabled so you can shut down
your CP members concurrently if you enabled CP Subsystem Persistence. It is
enough for users to recall this rule while shutting down CP members when CP
Subsystem Persistence is disabled. If interested, you can read the rest of this
paragraph to learn the reasoning behind this rule. Each shutdown request
internally requires a Raft commit to the METADATA CP group when CP Subsystem
Persistence is disabled. A CP member proceeds to shutdown after it receives
a response of this commit. To be able to perform a Raft commit, the METADATA
CP group must have its majority up and running. When only 2 CP members are
left after graceful shutdowns, the majority of the METADATA CP group becomes
2. If the last 2 CP members shut down concurrently, one of them is likely to
perform its Raft commit faster than the other one and leave the cluster before
the other CP member completes its Raft commit. In this case, the last CP member
waits for a response of its commit attempt on the METADATA CP group, and
times out eventually. This situation causes an unnecessary delay on the shutdown
process of the last CP member. On the other hand, when the last 2 CP members
shut down serially, the N-1
th member receives the response of its commit
after its shutdown request is committed also on the last CP member. Then,
the last CP member checks its local data to notice that it is the last CP
member alive, and proceeds its shutdown without attempting a Raft commit on
the METADATA CP group.