NodeB is Killed or It Cannot Connect to the Database
Access to a database is vital for running jobs, running scheduler and cooperation with other nodes. Touching a database is also used for detection of dead process. When the JVM process of NodeB is killed, it stops touching the database and the other nodes may detect it.
Time-line describing the scenario:
0s-30s - the last touch on DB;
NodeB or its connection to the database is down;
90s - NodeA sees the last touch.
0-40s - a check-task running on NodeA detects an obsolete touch from NodeB;
the status of NodeB is changed to
stopped, jobs running on the NodeB aresolved, which means that their status is changed toUNKNOWNand the event is dispatched among the Cluster nodes. The job result is considered aserror.
The following configuration properties set the time intervals mentioned above:
cluster.node.touch.intervalThe periodicity of a database touch, in milliseconds.
Default: 20000
cluster.node.touch.forced_stop.intervalThe interval when the other nodes accept the last touch, in milliseconds.
Default: 60000
cluster.node.check.checkMinIntervalThe periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.touch.forced_stop.solve_running_jobs.enabledA boolean value which can switch the
solvingof running jobs mentioned above.