none
help!! Scheduler is unresponsive to job submission RRS feed

  • Question

  • Yesterday I didn't have any problems submitting jobs to the head node from a client PC.

    Today the client just hangs when I do a job submit via a command line call after some time it finally responds with:

    A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

    Does anyone have any ideas on this?

    Thanks in advance.

    Thursday, October 8, 2015 4:47 PM

All replies

  • Can you run the job submit command on the head node?

    If the command can be run on the head node, please check the firewall setting of head node, whether a inbound rule "HPC Job Scheduler Service (TCP-In)" with local port 5800,5801,5969,5999 is enabled.

    If the command cannot be run on the head node either, try to restart the HPC Job Scheduler service.

    Friday, October 9, 2015 1:59 AM
  • Thank you for the feedback on this.

    I've done some more investigation, here are results:

    1. I can submit jobs from scheduler without any issues.

    2. I can ping the head node from the client without issue.

    3. Looking at the firewall settings of the head node "HPC Job Scheduler Service (TCP-In)"  Inbound rule is marked as private and I'm unable to edit it.  It only has one port enabled, port 5970.  

    Should I be able to edit that rule?

    Should I manually create a new rule and add all the new ports? 

    Thanks again for the help with this.

    Luke
    • Edited by Lsagur Friday, October 30, 2015 1:23 PM
    Friday, October 30, 2015 1:22 PM
  • After changing configuration on the head to "do not manage firewall settings" I now see the rule in the firewall with all the ports you mentioned enabled.

    I still can't connect to the head from the client PC though (although I can ping it). 

    Are there some outbound rules required on the client pc? currently I don't see anything in the firewall settings.

    Thanks,

    Luke

    Friday, October 30, 2015 1:56 PM
  • Hi,

        After setup, usually our system has configured the necessary firewall rules. Please double check:

    1. whether your client machine has joined the same domain as your headnode

    2. whether the version of your client matches your server version

    3. Whether you are using a domain account logged on your client machine

    4. Is the job manager GUI able to connect to the headnode? (HPCJObManager.exe)


    Qiufang Shi

    Monday, November 2, 2015 2:05 AM
  • Thank you Qiufang for all the sugg

    1. yes same domain.

    2. Client PC -> MS HPC PAck 2012 R2 Client Components (4.4.4864.0)  Head Node -? MS HPC Pack 2012 Server Components (4.4.4864.0).

    3. yes domain account

    4. No -> There was a network problem or the server was disconnected. Please try connection again. Failed to connect to the following service(s) on the head node: scheduler service.

    I tried on another PC on the network and I'm getting the same result.  

    What do you recommend as the next steps?

    Thanks,

    Luke

    Monday, November 2, 2015 12:37 PM
  • Looks like the configuration is okay. You may double check whether you can reach the scheduler port from the client machine, usually you can try telnet. The scheduler port is 5800, for example: telnet open hostname port

    And please also check whether you've enabled IPv6, whether your DNS resolves IPv6 as default, or simply you can try to connect to the scheduler with IPv4 address directly from commandline such as job submit /scheduler:xxx.xxx.xxx.xxx hostname

    And check whether you can submit job from compute nodes, this may help isolate the problem (Whether it is cluster configuration issue or the client problem)

    Lastly, check whether you can firewalls on your client prevents your application from connecting to the headnode


    Qiufang Shi

    Tuesday, November 3, 2015 1:39 AM
  • Qiufang, we figured it out!

    VNC viewer was installed on the headnode, it uses port 5800, so we had to get that service off to open up that port. After getting that change everything is functioning well.

    Thanks for your help.

    Luke


    Friday, November 6, 2015 5:09 PM
  • Changed the WCF service to just run as local system account instead of specifying it in the service panel.
    Sunday, November 10, 2019 6:18 PM