Long-Term Network Malfunction May Cause Jobs to Hang on
The jobflow or master execution executing child jobs on another Cluster nodes must be notified about status changes of their child jobs. When the asynchronous messaging doesn’t work, events from the child jobs aren’t delivered, so the parent jobs keep running. When the network works again, the child job events may be re-transmitted, so hung parent jobs may be finished. However, the network malfunction may be so long, that the event can’t be re-transmitted.
See the following time-line to consider a proper configuration:
-
job A running on NodeA executes job B running on NodeB;
-
the network between NodeA and NodeB is down from some reason;
-
job B finishes and sends the
finished
event; however, it can’t be delivered to NodeA – the event is stored in thesent events buffer
; -
since the network is down, a heart-beat can’t be delivered as well and maybe HTTP connections can’t be established, the Cluster reacts as described in the sections above. Even though the nodes may be suspended, parent job A keeps waiting for the event from job B.
-
now, there are 3 possibilities:
-
The network finally starts working and since all undelivered events are in the
sent events
buffer, they are re-transmitted and all of them are finally delivered. Parent job A is notified and proceeds. It may fail later, since some Cluster nodes may be suspended. -
Network finally starts working, but the number of the events sent during the malfunction exceeded the
sent events
buffer limit size. So some messages are lost and won’t be re-transmitted. Thus the buffer size limit should be higher in the environment with unreliable network. Default buffer size limit is 10,000 events. It should be sufficient for thousands of simple job executions; basically, it depends on number of job phases. Each job execution produces at least 3 events (job started, phase finished, job finished). Please note that there are also other events fired occasionally (configuration changes, suspending, resuming, cache invalidation). Also messaging layer itself stores own messages to the buffer, but the number is negligible (tens of messages per hour). The heart-beat is not stored in the buffer.There is also an inbound events buffer used as a temporary storage for events, so events may be delivered in correct order when some events can’t be delivered at the moment. When the Cluster node is inaccessible, the inbound buffer is released after timeout, which is set to 1 hour, by default.
-
Node B is restarted, so all undelivered events in the buffer are lost.
-
The following configuration properties set the time intervals mentioned above:
cluster.jgroups.protocol.NAKACK.gc_lag
-
Limits the size of the sent events buffer; Note that each stored message takes 2kB of heap memory.
Default: 10000
cluster.jgroups.protocol.NAKACK.xmit_table_obsolete_member_timeout
-
An inbound buffer timeout of unaccessible Cluster node.