none
how to select cores from nodes to run job on multiple nodes. RRS feed

  • Question

  • Hi all,

        I have 2 nodes in my cluster with 4 core on each node.

        I have one exe file called sleep.exe. I submitted job with  Job submit /numnodes:2 mpiexec -cores 2 sleep.exe then it was open 2 sleep.exe processes on each node.


        And I have a 4 core Ansys CFX job, and I want to run this job on 2 core from first node and other 2 core from second node.

        I have tried  with job submit /numnodes:2 /workdir:<working directory path> /stdout:out.log /stderr:error.log mpiexec -cores 2 cfx5solve.exe -v -def <.def file> -start-method MSMPI -part 4. Then the job got failed and generated below error information in error.log file


    "An error has occurred in cfx5solve:

    Error reported by IO module: readIntFmtData: (fgets failed) syserr:: No
    error

    An error has occurred in cfx5solve:

    Error reported by IO module: iif_set_lock: error reading lock file
    //litocmaster/work/benchmark.def.lck: No error

    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    Can't call method "name" on an undefined value at C:\Program Files\ANSYS Inc\v110\CFX\bin\/perllib/CFX5/Job/Settings.pm line 2464.
    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    Can't call method "name" on an undefined value at c:\Program Files\ANSYS Inc\v110\CFX\bin\/perllib/CFX5/Job/Settings.pm line 2464.
    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    Can't call method "name" on an undefined value at c:\Program Files\ANSYS Inc\v110\CFX\bin\/perllib/CFX5/Job/Settings.pm line 2464.
    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    An error has occurred in cfx5solve:

    Neither Start Command nor Option is defined for start method MSMPI; check
    that you have given the method name correctly.

    Can't call method "name" on an undefined value at C:\Program Files\ANSYS Inc\v110\CFX\bin\/perllib/CFX5/Job/Settings.pm line 2464.
    "
     


    But when I submit the job with out mpiexec option, The job is running fine on available resources.

    Will mpiexec works with all applications or not. Please give me suggessions on this. And any body tested this kind of scenario with Starccm application.

    Regards,
    P. Kalyan Rao
     
    • Moved by Josh Barnard Wednesday, June 17, 2009 7:03 PM Seems to be more about MPI startup (From:Windows HPC Server Job Submission and Scheduling)
    Wednesday, June 17, 2009 7:29 AM

Answers

  • Hi Kalyan, could you check the  option -start-method? From the help of cfx5solve -help, you can find the following:

    -start-method <name>
        Use the named start method to start the solver.  This option
        allows you to use different parallel methods, as listed in the
        Solver Manager GUI or in the etc/start-methods.ccl file, instead
        of the defaults.  For parallel start methods, you must also provide
        the -part or -par-dist arguments.

     Also, this option should be quoted. I've tried -start-method "MPICH2 Local Parallel for Windows" and it looks working for me.

    Thanks,
    James
    Friday, June 19, 2009 11:03 PM
  • Hi Kalyan,

    correct, the scheduler has no options for process placements; the available options are /numnodes /numcores and /numsockets (the last one will allocate a process per socket, you can use it with the mpiexec -affinity option).

    another option for you is to manipulate the CCS_NODES env var before calling mpiexec. that is, submit a script that changes CCP_NODES and then calls mpiexec.

    I assume that cfx5solve sees inconsistency between the mpi world size and CCP_NODES and bails out. I cant see any other reason why the app behaves different with and without this switch.

    thanks,
    .Erez
    Wednesday, June 24, 2009 6:54 PM

All replies

  • Hi,

    Is this the complete error output? it seems that the application, cfx5solve.exe, is bailing out even before mpi_init. Please contact ANSYS support. some applications want to run with a specific core configuration and they check it internally. it might be the case here.

    thanks,
    .Erez

    P.S. does it run correclty when removing the "-cores 2" switch and replacing /numnodes:2 with /numcores:8 ?  (2 nodes using all cores)
    Thursday, June 18, 2009 4:26 PM
  • Hi Lio,

       When I use MPIEXEC option before cfx5solve command, the job is getting finished. Job is running fine without mpiexec option.

    Thanks,
    P.Kalyan Rao
    Friday, June 19, 2009 10:28 AM
  • Hi Kalyan, could you check the  option -start-method? From the help of cfx5solve -help, you can find the following:

    -start-method <name>
        Use the named start method to start the solver.  This option
        allows you to use different parallel methods, as listed in the
        Solver Manager GUI or in the etc/start-methods.ccl file, instead
        of the defaults.  For parallel start methods, you must also provide
        the -part or -par-dist arguments.

     Also, this option should be quoted. I've tried -start-method "MPICH2 Local Parallel for Windows" and it looks working for me.

    Thanks,
    James
    Friday, June 19, 2009 11:03 PM
  • hi James,

       I am using -star-method MSMPI option. in this situaltion how to use.

    thanks,
    P. kalyan rao
    Saturday, June 20, 2009 4:15 AM
  • Hi Kalyan,

    you can't use MSMPI as the name of the -start-method. First find the file start-methods.ccl in your computer. You will find a list of START METHOD options and the corresponding usage in the file. Find the one for Windows. For example, if you want to run your job in parallel, you need use: -start-method "MPICH2 Distributed Parallel for Windows"

    Thanks,
    James

    Saturday, June 20, 2009 7:21 AM
  • Hi James,

       Ansys CFX 11 with SP1 supports Windows HPC. So we can use MSMPI option at -start-method. My customers are running CFX on Windows HPC Cluster with MSMPI option only. The command we used in task list is : cfx5solve -v -def <Input file name> -start-method MSMPI -part <number of Processors> .
    So the job will run on available cores.

      But there is no option in job submission wizard to select specific cores from specific node. It is possible with mpiexec - cores . So I tried by adding this options before cfx5solve command option. And I got message which was posted first.

    Thanks and Regards,
    P.Kalyan Rao
    Sunday, June 21, 2009 6:40 AM
  • Hi Kalyan,

    correct, the scheduler has no options for process placements; the available options are /numnodes /numcores and /numsockets (the last one will allocate a process per socket, you can use it with the mpiexec -affinity option).

    another option for you is to manipulate the CCS_NODES env var before calling mpiexec. that is, submit a script that changes CCP_NODES and then calls mpiexec.

    I assume that cfx5solve sees inconsistency between the mpi world size and CCP_NODES and bails out. I cant see any other reason why the app behaves different with and without this switch.

    thanks,
    .Erez
    Wednesday, June 24, 2009 6:54 PM