How do I handle nodes dropping out? RRS feed

  • Question

  • I am set up to use 8 nodes for a job.  Is there a way to set the job up so that if a server goes down it will continue using 7 nodes?  I have my max nodes set to 8 and my min nodes set to say 6.  That means that if 2 servers dropped out I should still run correct?  That's not what is happening.  Instead the job is failing.
    Wednesday, September 19, 2012 9:12 PM

All replies

  • I doubt this functionality is provided by cluster manager.Once one node is down the job fails.One solution would be to split the job into a few jobs and one job would run on 2-3 nodes.Of course, you need to implement a controller that dispatch tasks to each job.

    How do you send your job to the cluster?clusrun,job submit etc?

    Daniel Drypczewski

    Thursday, September 20, 2012 8:03 AM
  • I am sending the jobs using JobSubmit.  

    I was thinking there should be some way of making this work since there is a Min. and Max. setting for the resources.  As long as there are Min. amount of servers available I thought it would still run...just on less machines.

    Any other ideas would be appreciated.

    Thursday, September 20, 2012 12:45 PM
  • I understand min/max as "allocate minimum number of nodes (6) but not more than (8) for my job and run it" otherwise report resource problem error.I think this problem is related to the underlying protocol.When a node experience a failure and is down the cluster manager gets an error ( or maybe connection timeout occurs) and basically shuts down connection to the remaining nodes.

    I can see more general problem here.

    Say, you allocate 100 cores for one job ,and another 100 cores for another job.After some time the first job processed its tasks and doesn't need all 100 cores anymore  - some cores don't do any work but can not be "freed" until all tasks are processed.The second job could resue some of first job's cores and finish faster.In case when 80% of tasks calculation time is a few min and 20% tasks has to run 1 hr to be finished, all the cores are locked for 1 hr.

    Maybe HPC developers could explain a bit more about nodes management in case if failure on one node.

    Daniel Drypczewski

    Friday, September 21, 2012 3:42 AM