locked
Running a service when no other jobs are available RRS feed

  • Question

  • I have 4 nodes in my cluster that I want to devote to distributed code builds when there is nothing else to run.

    The way I expected this to work was that I would start a job that has service tasks with the lowest priority, and they would just run on every available node. Unfortunately, I have to keep an eye on it because there are a few problems:

    •  When I start the job, only the nodes that are online at that time get jobs (nodes that come online later don't get a task). I have it specified by node group.
    •  When a node loses network connection after a certain time it seems to give up on it and not reconnect

    To me it looks like when I submit the job all the tasks are calculated once based on the state of the cluster, where I want it to continually add or refresh the tasks. Is there any way to do what I want?

    Friday, October 5, 2012 9:32 AM

All replies

  • You may write a script that periodically verifies status of you cluster (which nodes are online/offline,which nodes you can connect to etc).Something like:

    job list -> list currently running jobs (if any)

    node list -> which cores are currently online/offline

    node listcores -> which cores are free/idle

    node online computername -> change state of this node (computername) to online

    If there are no job running and some nodes/cores are idle then

    job submit parallel_build.exe or job submit mpiexec parallel_build.exe


    Daniel Drypczewski

    Wednesday, October 10, 2012 8:08 AM