locked
networking problem between compute node and head node RRS feed

  • Question

  • I have an HPC job of 500+ tasks running in paralel, some of the tasks showed up in the job manager as 'Failed' but they were actually still running in compute nodes and eventually finished with expected output results.

    Is there a network problem here between compute node an head node, and if so how to debug it further?

    Thanks

    Sam


    • Edited by Sam CG Monday, February 24, 2014 5:59 PM
    Monday, February 24, 2014 5:57 PM

All replies

  • Hi Sam,

    Some questions:

    1. What hpc version are you using?

    2. Can you check the failure reason of the tasks? (you can check it either in UI-->view job-->View tasks, or in command line job view [job id] /detailed, task view jobid.taskid /detailed

    3. Or you can open event log viewer in head node, go to "Applications and Services Logs" --> Microsoft --> HPC

    4. If you need to check compute node log, you can open log like 3# in compute nodes

    Thanks

    -Zhonggang

    Monday, March 17, 2014 2:37 AM