locked
Job submit error RRS feed

  • Question

  • Hi everyone!

          I have installed CCS2003 and I tried to develop some parallel application using MSMPI. My cluster has 2 computational nodes: clusternode2101 and clusternode2102. The head node is named headcluster210 and it is not a computational node.


         My application works fine if I run it on the clusternode2101, but it does not if I run it on the clusternode2102. Even more I have found the next problem on this node:

    - if I submit a simple job as the following:

    job submit  /numprocessors:4 /askednodes:clusternode2101,clusternode2102 /stdout:\\headcluster210\PDC\out.txt /stderr:\\headcluster210\PDC\err.txt /scheduler:headcluster210 mpiexec -l hostname


    the jobs failes on the clusternode2102, even if it runs correctly if I submit it from the head node or from the clusternode2101. On clusternode2102 I got:

    Job ID                     : 68
    Status                    : Failed
    Name                     : APDCLUSTER\Administrator:Jul  2 2008 10:22AM
    Submitted by         : APDCLUSTER\Administrator
    Number of processors : 4-4
    Allocated nodes          :
    Submit time        : 7/2/2008 10:22:01 AM
    Start time           : 7/2/2008 10:22:03 AM
    End time             : 7/2/2008 10:22:03 AM
    Error message        : Failed to activate job 68. An error occurred while communicating with compute node CLUSTERNODE2101. Logon failure: unknown user name or bad password.
    Number of tasks         : 1
        Notsubmitted         : 0
        Queued                  : 1
        Running                 : 0
        Finished                 : 0
        Failed                     : 0
        Cancelled               : 0

    Also I got the same error if I submit this job from XP64 which has installed CCS client utilities. Which is the problem?

    Thakns in  advance.
    Tudor Cret
    Wednesday, July 2, 2008 7:27 AM

Answers

  • In order to work correctly, your job needs to run under a domain account (the Administrator account only exists locally on each machine).

    Try submitting with a Domain account and let us know if that works.

    Thanks,
    Josh
    -Josh
    Tuesday, July 8, 2008 6:12 AM
    Moderator

All replies

  • It's hard to tell what the problem is from your description . . . when you do the job submit command you provided, you get the error message listed?

    Can you confirm that your CN's are Domain-Joined and able to conact the domain?  To verify that, try initiating a Remote Desktop connection to the node and make sure you can login to them with your domain credentials.

    -Josh
    -Josh
    Wednesday, July 2, 2008 3:43 PM
    Moderator
  • Yes. I can connect to the domain, I use Administrator account for submitting jobs and I have the same password on all computers in the cluster. After I submit the job I see in Job Scheduler GUI that the job has failed. Then if I execute job view <jobid> from command line I see the status of the job- in the case presented above the job 68 failed because the communication with node1 could not be made because the user name or password were wrong. And this even if I use the same user/password on both nodes, for everything( I use Administrator account ). 
    Tudor Cret
    Thursday, July 3, 2008 11:28 PM
  • In order to work correctly, your job needs to run under a domain account (the Administrator account only exists locally on each machine).

    Try submitting with a Domain account and let us know if that works.

    Thanks,
    Josh
    -Josh
    Tuesday, July 8, 2008 6:12 AM
    Moderator
  • It seems that this was the problem. I made a new account in domain, I give it administrative rights in domain and it works. Thanks for help.
    Tudor Cret
    Thursday, July 10, 2008 7:10 AM