none
Regarding Node disconnected due to network

    Question

  • Dear all,

    In our current HPC testing, we need to check the below scenario:

    “A compute node is executing a job which is a long-running process P1 that writes to the database. This node gets disconnected from the computational cluster for a few seconds and comes back. During this disconnection period, the job gets re-scheduled to a different node and starts running a separate process P2. Can we have a situation where both P1 and P2 try to write to the same database? The process P1 is really a stale process and it's incorrect that this process writes to database."

    How can we avoid this situation? 

    Note: In our testing, job contains a single task which invokes an executable. This executable is our long running engine.

    Thanks,

    Puneet


    Puneet Sharma


    Wednesday, March 15, 2017 10:31 PM

All replies

  • From HPC Pack side, a few things will help your situation:
    1. If the disconnection happens a lot and usually recoverable, please try to increase the "heartbeat setting" in the scheduler to make the scheduler think the node is still alive
    2. Fail fast: by default, our scheduler will retry the task 3 times in case node failure or system failure, you can set this retry count to 0 so that the task won't be retried
    3. Have your task aware whether it is running in a retry run and check CCP_TASK_PREV_EXITCODE value -- this might not work for your case
    4. Usually, the task will be cancelled by our scheduler during a node resync right after the node connection is recovered. Currently there is no task timeout of a disconnected compute node which might worth adding from our side (Kill the process if the connection to scheduler can't be recovered in certain time period)
    More importantly you need handle this case in your own logics, for example:
    1. Your P1/P2 shall have a way to acquire the DB writer owner through a shared resource
    2. Your task is being able to timeout


    Qiufang Shi

    Thursday, March 16, 2017 2:43 AM