Version

    NodeB is Killed or It Cannot Connect to the Database

    Access to a database is vital for running jobs, running scheduler and cooperation with other nodes. Touching a database is also used for detection of dead process. When the JVM process of NodeB is killed, it stops touching the database and the other nodes may detect it.

    Time-line describing the scenario:
    • 0s-30s last touch on DB

    • NodeB or its connection to the database is down

    • 90s NodeA sees the last touch

    • 0-40s check-task running on NodeA detects obsolete touch from NodeB

    • status of NodeB is changed to stopped, jobs running on the NodeB are solved, which means that their status is changed to UNKNOWN and the event is dispatched among the Cluster nodes. The job result is considered as error.

    The following configuration properties set the time intervals mentioned above:
    cluster.node.touch.interval

    Periodicity of a database touch, in milliseconds.

    Default: 20000

    cluster.node.touch.forced_stop.interval

    An interval when the other nodes accept the last touch, in milliseconds.

    Default: 60000

    cluster.node.check.checkMinInterval

    Periodicity of Cluster node checks, in milliseconds.

    Default: 40000

    cluster.node.touch.forced_stop.solve_running_jobs.enabled

    A boolean value which can switch the solving of running jobs mentioned above.