none
OpenMP Job stops after switching to different compute node RRS feed

  • Question

  • I have an OpenMP application that can run 2, 4 and 8 cores and I request the resources with number of cores. My long simulation job will stop or crash for no specified reason, then I realized the job keep switching to a different compute node every few days. The activity log of the job shows the node changing during the computation and gives a false report status to Job Finished, but the job did not run to completion. Is that a way to ask a job to stay in a single node? If I submit the multi-core job with Sockets or Node resource, then the resources of the HPC cluster will be under utilized. Is that a better work around for this problem?

    Activity Log (Sample):

    2/2/2011 11:16:04am   Created by {User}; Submitted; Started

    2/2/2011 11:16:04am   Started on NODE-006 with 2 cores

    2/3/2011 4:33:34PM   Ended on NODE-006

    2/3/2011 4:33:34PM   Job Canceled

    2/3/2011 4:33:35PM  Started

    2/3/2011 4:33:35PM   Started on NODE-013 with 2 cores

    2/7/2011 12:05:31AM   Ended on NODE-013

    2/7/2011 12:05:31AM  Job Finished  -  The job basically stops right here 

     

     

     

     

    Saturday, February 12, 2011 11:21 AM

All replies

  • Hi,

    Your job is probably being canceled during execution because it is being preempted by other higher priority jobs and PreemptionType cluster parameter is set to Immediate. To prevent your job from being canceled, requeued and restarted on another node you may try:

     - marking your job as not preemptable (this can be done via appropriate job template or by submitting your job via .NET API),

     - changing preemption type cluster parameter: 'cluscfg setparams PreemptionType=Graceful'

     - submitting your job with higher priority.

    Your job can also be canceled at some point because of node being taken online/offline or node becoming unreachable for a short period of time (due to some network issues).

    Hope this helps. Let me know if you have any other questions.

    Thank you,
    Łukasz

    Tuesday, February 15, 2011 5:32 PM
  • Hi Lukasz,

    Do you think there is a bug that Microsoft HPC reallocated the cluster resources and it may assign an OpenMP job into 2 seperate nodes? That is what I think the jobs got stop because of the memory access in OpenMP jobs.

    Currently, I assign the serial job on the Core resource, and 2 or 4 parallel job on Socket resource. My jobs can run to completion without switching the node. All my jobs are running on 2 cores and lowest priority so I don't think there is preempted job problem. For the adaptive resource allocation, do you think this is the source of problem to switch nodes in between a simulation? 

    Thanks you,

    Teo

    Tuesday, February 15, 2011 10:47 PM
  • Windows Server 2012 with HPC Pack 2012 has a new feature to tie a job in single node only. This problem is fixed in HPC Pack 2012 release.

    Wednesday, February 13, 2013 10:16 PM
  • Maybe you could try setting affinity for your open-mp threads so that the physical core is unchanged during thread/process life-cycle.Take a look at these functions SetProcessAffinityMask(),SetThreadAffinityMask(),SetWindowDisplayAffinity().

    Daniel Drypczewski

    Friday, February 15, 2013 2:56 AM