Auto-Resuming in Unreliable Network

In version 4.4, auto-resuming of suspended nodes was introduced.

Time-line describing the scenario:
  • NodeB is suspended after connection loss

  • 0s NodeA successfully reestablishes connection to NodeB

  • 120s NodeA changes NodeB status to forced_resume

  • NodeB attempts to resume itself if the maximum auto-resume count is not reached.

  • If the connection is lost again, the cycle repeats; if the maximum auto-resume count is exceeded, the node will remain suspended until the counter is reset, to prevent suspend-resume cycles.

  • 240m auto-resume counter is reset

The following configuration properties set the time intervals mentioned above:
cluster.node.check.intervalBeforeAutoresume

Time a node has to be accessible to be forcibly resumed, in milliseconds.

Default: 120000

cluster.node.check.maxAutoresumeCount

How many times a node may try to auto-resume itself.

Default: 3

cluster.node.check.intervalResetAutoresumeCount

Time before the autoresume counter will be reset, in minutes.

Default: 240