none
Running parallel applications on CCS 2003 issue RRS feed

  • Question

  • Hi everyone!

    Currently I am developing a simple parallel application using MS-MPI and CCS 2003. I have a cluster with two computational nodes.

    I wrote my application and I started to debug it using MPI Cluster Debugger. Everything work fine if I start debugging on one computional node, named 'node1', but when I try to debug the SAME application, having the SAME setting on the second node, named 'node2' I get the following error:


    "unable to read authorization result from node2. socket connection closed

    Aborting: Access denied by node 'node2'.
    A common cause: mpiexec attempting to use this node which was not allocated to job '16.0' by the Compute Cluster scheduler.
    Press any key to continue . . ."


    '16.0' is the id of the job from the scheduler used by mpiexec in order to start the parallel application. My job has very clearly specified to use 'node2' as computational node. In my opinion it seems that 'node2' does not seen correctly cluster's scheduler. Can you help me with an idea or solution?

    Thanks.


    Tudor Cret
    Tuesday, July 1, 2008 11:52 AM

Answers

  • Seems like the cluster is setup with its own domain; is the domain controller running on the headnode?.

    Aha.. are you submitting from node2 or to node2?
    if its from node2, it might be that you have the wrong credentials cached on node2.

    if its failing when you submit to node2. I suggest that you check node2 for the set of users that have permissions to logon.  I would TS into node2 and check,
    is node2 in the APDCLUSTER domain?
    is APDCLUSTER\Administrator in node2\Administrators group?

    hope this helps,
    .Erez

    • Marked as answer by Tudor Cret Thursday, July 10, 2008 7:12 AM
    Monday, July 7, 2008 7:01 PM

All replies

  • Hi Tudor,

    Please see the following 2 resources regarding the Parallel Debugger within Visual Studio and MPI Application Debugging:

    1.  Basic Usage of the Visual Studio MPI Debugger:   http://windowshpc.net/Resources/Documents/BasicUsageParallelDebugger.zip

        a.   This article was written relative to CCS 2003 and Visual Studio 2005.  

    2.  Dr. Joe Hummel's "Classic HPC Dev" tutorial:  http://www.pluralsight.com/community/blogs/drjoe/archive/2008/06/19/51178.aspx

        a.   This tutorial includes MPI Debugger guidance relative to HPC Server 2008 and Visual Studio 2008.  It should also apply to CCS 2003 and VS 2005 although I haven't personally verified that.


    Thanks for using Windows HPC...!
    http://channel9.msdn.com/shows/the_hpc_show
    Wednesday, July 2, 2008 2:01 PM
  • Hi Tudor,

    I think that you're problem might be very simple.

    The job scheduler allocate the resources and mpi uses them; most likely you specified the node to use on the mpiexec command line (e.g., with the -hosts switch) which conflicts with the scheduler allocated resources. A tip: don't specify the nodes on mpiexec command line. for example,

    job submit /numprocessors:16 /workdir:\\headnode\share\app /stdout:out.txt /stderr:out.txt mpiexec myapp.exe 

    will launch the mpi application "myapp.exe" on 16 cores with the working directory \\headnode\share\app. the stdout and stderr of the application will be written to out.txt (in the workdir).
    those 16 cores could be located on 1 to 16 nodes, depending on your configuration.

    To run your program on specific nodes add "/askednodes:node1,node2" to the above command line.

    hope this helps,
    .Erez
    Monday, July 7, 2008 1:57 AM
  • Hi. Thanks for helping me. I made some more investigations and I found that if I choose to submit any kind of job from my second node -'node2', the job fails if it has to use the other computational node - 'node1'. For example I submit the following job from 'node2' :

     
    job submit  /numprocessors:4 /askednodes:clusternode2101,clusternode2102 /stdout:\\headcluster210\PDC\out.txt /stderr:\\headcluster210\PDC\err.txt /scheduler:headcluster210 mpiexec -l hostname


    The job fails and if I use job view command line I found that:


    Job ID                     : 68
    Status                    : Failed
    Name                     : APDCLUSTER\Administrator:Jul  2 2008 10:22AM
    Submitted by         : APDCLUSTER\Administrator
    Number of processors : 4-4
    Allocated nodes          :
    Submit time        : 7/2/2008 10:22:01 AM
    Start time           : 7/2/2008 10:22:03 AM
    End time             : 7/2/2008 10:22:03 AM
    Error message        : Failed to activate job 68. An error occurred while communicating with compute node CLUSTERNODE2101. Logon failure: unknown user name or bad password.
    Number of tasks         : 1
        Notsubmitted         : 0
        Queued                  : 1
        Running                 : 0
        Finished                 : 0
        Failed                     : 0
        Cancelled               : 0

    And this happens even if I do not specify /askednodes parameter.
    I use the same password for Administrator on all nodes in the cluster, I checked the connectivity between nodes and it is ok, the domain is up and running. 

    Thanks.



    Tudor Cret
    Monday, July 7, 2008 5:24 PM
  • Seems like the cluster is setup with its own domain; is the domain controller running on the headnode?.

    Aha.. are you submitting from node2 or to node2?
    if its from node2, it might be that you have the wrong credentials cached on node2.

    if its failing when you submit to node2. I suggest that you check node2 for the set of users that have permissions to logon.  I would TS into node2 and check,
    is node2 in the APDCLUSTER domain?
    is APDCLUSTER\Administrator in node2\Administrators group?

    hope this helps,
    .Erez

    • Marked as answer by Tudor Cret Thursday, July 10, 2008 7:12 AM
    Monday, July 7, 2008 7:01 PM

  • "Seems like the cluster is setup with its own domain; is the domain controller running on the headnode?."
     

    Yes. It is setup with its own domain, with the domain controller running on the headnode.

    "Aha.. are you submitting from node2 or to node2?"

    It is failing when I submit from node2. I cleared the credentials cached and I tried again, but I still have the same error. Also node2 is in the APDCLUSTER domain

    Thanks
    Tudor Cret
    Monday, July 7, 2008 7:42 PM
  • It seems I had to make a domain account and to give it administrative rights. Then everything works fine.
    Tudor Cret
    Thursday, July 10, 2008 7:12 AM
  • Dear all,

    I have encountered the same problem on ANSYS parallel.

    I have no idea to solve the problem. Does microsoft engineers have email support? Because I login this website difficult.
    My email is chengczy@sohu.com, Thanks.
    Thursday, July 10, 2008 12:31 PM