NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB
TCP connection is used for asynchronous messaging. When the NodeB can't send/receive asynchronous messages, the other nodes aren't notified about started/finished jobs, so a parent jobflow running on NodeA keeps waiting for the event from NodeB. A heart-beat is vital for meaningful load-balancing, the same check-task mentioned above also checks the heart-beat from all Cluster nodes.
Time-line describing the scenario:
0s - the network connection between NodeA and NodeB is down;
60s - NodeA uses the last available NodeB heart-beat;
0-40s - a check-task running on NodeA detects the missing heart-beat from NodeB;
the status of NodeA or NodeB (the one with shorter uptime) is changed to
suspended
.
The following configuration properties set the time intervals mentioned above:
cluster.node.check.checkMinInterval
The periodicity of Cluster node checks, in milliseconds.
Default: 40000
cluster.node.sendinfo.interval
The periodicity of heart-beat messages, in milliseconds.
Default: 2000
cluster.node.sendinfo.min_interval
A heart-beat may occasionally be sent more often than specified by
cluster.node.sendinfo.interval
. This property specifies the minimum interval in milliseconds.Default: 500
cluster.node.remove.interval
The maximum interval for missing a heart-beat, in milliseconds.
Default: 50000