NodeB is Killed or It Cannot Connect to the Database
Access to a database is vital for running jobs, running scheduler and cooperation with other nodes. Touching a database is also used for detection of dead process. When the JVM process of NodeB is killed, it stops touching the database and the other nodes may detect it.
Time-line describing the scenario:
0s-30s - the last touch on DB;
NodeB or its connection to the database is down;
90s - NodeA sees the last touch.
0-40s - a check-task running on NodeA detects an obsolete touch from NodeB;
the status of NodeB is changed to
stopped
, jobs running on the NodeB aresolved
, which means that their status is changed toUNKNOWN
and the event is dispatched among the Cluster nodes. The job result is considered aserror
.
The following configuration properties set the time intervals mentioned above:
cluster.node.touch.interval
The periodicity of a database touch, in milliseconds.
Default: 20000
cluster.node.touch.forced_stop.interval
The interval when the other nodes accept the last touch, in milliseconds.
Default: 60000
cluster.node.check.checkMinInterval
The periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.touch.forced_stop.solve_running_jobs.enabled
A boolean value which can switch the
solving
of running jobs mentioned above.