NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB
TCP connection is used for asynchronous messaging. When the NodeB can't send/receive asynchronous messages, the other nodes aren't notified about started/finished jobs, so a parent jobflow running on NodeA keeps waiting for the event from NodeB. A heart-beat is vital for meaningful load-balancing, the same check-task mentioned above also checks a heart-beat from all Cluster nodes.
Time-line describing the scenario:
0s network connection between NodeA and NodeB is down
60s NodeA uses the last available NodeB heart-beat
0-40s check-task running on NodeA detects missing heart-beat from NodeB
status of NodeA or NodeB (the one with shorter uptime) is changed to
suspended
The following configuration properties set the time intervals mentioned above:
cluster.node.check.checkMinInterval
Periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.sendinfo.interval
Periodicity of heart-beat messages, in milliseconds.
Default: 2000
cluster.node.sendinfo.min_interval
A heart-beat may occasionally be sent more often than specified by
cluster.node.sendinfo.interval
. This property specifies the minimum interval in milliseconds.Default: 500
cluster.node.remove.interval
The maximum interval for missing a heart-beat, in milliseconds.
Default: 50000