Auto-Resuming in Unreliable Network
In version 4.4, auto-resuming of suspended nodes was introduced.
Time-line describing the scenario:
-
NodeB is suspended after connection loss
-
0s - NodeA successfully reestablishes the connection to NodeB;
-
120s - NodeA changes the NodeB status to
forced_resume
; -
NodeB attempts to resume itself if the maximum auto-resume count is not reached;
-
If the connection is lost again, the cycle repeats; if the maximum auto-resume count is exceeded, the node will remain suspended until the counter is reset, to prevent suspend-resume cycles.
-
240m auto-resume counter is reset
The following configuration properties set the time intervals mentioned above:
cluster.node.check.intervalBeforeAutoresume
-
Time a node has to be accessible to be forcibly resumed, in milliseconds.
Default: 120000
cluster.node.check.maxAutoresumeCount
-
How many times a node may try to auto-resume itself.
Default: 3
cluster.node.check.intervalResetAutoresumeCount
-
Time before the auto-resume counter will be reset, in minutes.
Default: 240