none
Job Submission Test Failed RRS feed

  • Question

  • Hello,
     
    When I start Diagnostics for my cluster the Job Submission Test always fails with an error message 

    Job Failed: Failed to start on node HPC002. Error: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 192.168.0.2:1856

    Where is to look for the cause (that's the only test that fails)? Thanks in advance

    Monday, November 24, 2008 9:13 AM

Answers

  • Hi Bubu,

    It looks like the scheduler is unable to reach the node HPC002. Since the other diagnostic tests succeed, that rules out several possible causes. Some other things you can check:

    1) Make sure there isn't an issue with the firewall. Either look at the results from the Firewall Configuration Report diagnostic to make sure node HPC002 has the same firewall settings as the other compute nodes in your cluster or turn off the firewall temporarily to see if the Job Submission Test succeeds when the firewall on HPC002 is disabled.

    2) Can you run jobs that you submit through other channels (other than running the diagnostic test), or do they fail too with the same error? If all jobs attempting to run on HPC002 fail with the same error, try restarting the job scheduler service on the head node (on the command line on the head node, run "sc stop HPCScheduler" then "sc start HPCScheduler").

    Ann



    Tuesday, November 25, 2008 7:00 PM