none
problem with parallel Fluent job on windows compute cluster 2003 RRS feed

  • Question

  • Hi,

    I'm running Fluent 6.3.26 but experience some problems to submit jobs using the job scheduler. I use the following job template:

    <?xml version="1.0" encoding="utf-8" ?>
     <Job xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" SoftwareLicense="" MaximumNumberOfProcessors="16" MinimumNumberOfProcessors="16" Runtime="00:01:00" IsExclusive="true" Priority="Normal" Name="test" Project="" RunUntilCanceled="false">
     <Tasks xmlns="http://www.microsoft.com/ComputeCluster/">
     <Task MaximumNumberOfProcessors="16" MinimumNumberOfProcessors="16" Depend="" WorkDirectory="\\my_dir" Name="My Task" CommandLine="fluent 3ddp -r6.3.26 -t16 -mpi=ms -g -hidden -i input-DES.jou > output.jou" IsExclusive="true" IsRerunnable="true" Runtime="Infinite">
     <EnvironmentVariables>
     <Variable>
      <Name>FLUENTLM_LICENSE_FILE</Name>
      <Value>1234@our_server</Value>
      </Variable>
      </EnvironmentVariables>
      </Task>
      </Tasks>
      </Job>




    but when I examine the Fluent output it becomes clear that all the 16 parallel processes run on just 1 node:

    ------------------------------------------------------------------------------
    ID     Comm.   Hostname        O.S.        PID     Mach ID HW ID   Name       
    ------------------------------------------------------------------------------
    host   net     node024      Windows-x64 4092    0       2196    Fluent Host
    n15    msmpi   node024      Windows-x64 300     0       15      Fluent Node
    n14    msmpi   node024      Windows-x64 5036    0       14      Fluent Node
    n13    msmpi   node024      Windows-x64 3768    0       13      Fluent Node
    n12    msmpi   node024      Windows-x64 6908    0       12      Fluent Node
    n11    msmpi   node024      Windows-x64 5376    0       11      Fluent Node
    n10    msmpi   node024      Windows-x64 6508    0       10      Fluent Node
    n9     msmpi   node024      Windows-x64 5040    0       9       Fluent Node
    n8     msmpi   node024      Windows-x64 4040    0       8       Fluent Node
    n7     msmpi   node024      Windows-x64 6140    0       7       Fluent Node
    n6     msmpi   node024      Windows-x64 5928    0       6       Fluent Node
    n5     msmpi   node024      Windows-x64 3884    0       5       Fluent Node
    n4     msmpi   node024      Windows-x64 5068    0       4       Fluent Node
    n3     msmpi   node024      Windows-x64 2096    0       3       Fluent Node
    n2     msmpi   node024      Windows-x64 7108    0       2       Fluent Node
    n1     msmpi   node024      Windows-x64 6232    0       1       Fluent Node
    n0*    msmpi   node024      Windows-x64 4876    0       0       Fluent Node

    ------------------------------------------------------------------------------

    After talking with the system administrator, he adviced me to use the following command to run parallel jobs:

    Start /B fluent 3ddp -r6.3.26 -t16 -ccp node001 -g -hidden -i input-DES.jou > output3-DES.jou

    The resulting Fluent output shows that multiple nodes are used:

    ------------------------------------------------------------------------------
    ID     Comm.   Hostname        O.S.        PID     Mach ID HW ID   Name       
    ------------------------------------------------------------------------------
    n15    msmpi   node027      Windows-x64 3800    4       15      Fluent Node
    n14    msmpi   node027      Windows-x64 4048    4       14      Fluent Node
    n13    msmpi   node027      Windows-x64 1628    4       13      Fluent Node
    n12    msmpi   node034      Windows-x64 1036    3       12      Fluent Node
    n11    msmpi   node034      Windows-x64 2752    3       11      Fluent Node
    n10    msmpi   node034      Windows-x64 2936    3       10      Fluent Node
    n9     msmpi   node034      Windows-x64 3208    3       9       Fluent Node
    n8     msmpi   node024      Windows-x64 2804    2       8       Fluent Node
    n7     msmpi   node024      Windows-x64 6216    2       7       Fluent Node
    n6     msmpi   node024      Windows-x64 5432    2       6       Fluent Node
    n5     msmpi   node024      Windows-x64 4644    2       5       Fluent Node
    host   net     node001      Windows-x64 8188    1       1084    Fluent Host
    n4     msmpi   node022      Windows-x64 2908    0       4       Fluent Node
    n3     msmpi   node022      Windows-x64 4132    0       3       Fluent Node
    n2     msmpi   node022      Windows-x64 5632    0       2       Fluent Node
    n1     msmpi   node022      Windows-x64 5072    0       1       Fluent Node
    n0*    msmpi   node022      Windows-x64 996     0       0       Fluent Node

    ------------------------------------------------------------------------------

    This is already a lot better than the first option but I'm not 100% satisfied since this method will always uses the head node (node001) as Fluent Host. If multiple parallel jobs are started this will slow down the head node which in turn will slow down all the jobs.

    Is there someone who knows how to submit a Fluent job that runs on multiple nodes and does not use the head node as Fluent host?

    Thanks in advance,

    Koen

    Tuesday, June 24, 2008 3:50 PM

Answers

  • Hi,

    I couldn't get it Fluent to work with the -ccp flag so I tried to start Fluent the same way as on the Linux cluster. I created the following batch file which creates a fluenthost.txt file containing the names of the nodes. This is probably not the most beautiful code but it works. 

    SET NofN=%CCP_NODES:~0,1% 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET NODE%%G=[] 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET /A avoid%%G=2+(%%G-1)*13 
     
    SET length=10 
     
    CALL SET node1=%%CCP_NODES:~%avoid1%,%length%%% 
    CALL SET node2=%%CCP_NODES:~%avoid2%,%length%%% 
    CALL SET node3=%%CCP_NODES:~%avoid3%,%length%%% 
    CALL SET node4=%%CCP_NODES:~%avoid4%,%length%%% 
    CALL SET node5=%%CCP_NODES:~%avoid5%,%length%%% 
    CALL SET node6=%%CCP_NODES:~%avoid6%,%length%%% 
    CALL SET node7=%%CCP_NODES:~%avoid7%,%length%%% 
    CALL SET node8=%%CCP_NODES:~%avoid8%,%length%%% 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET CPU%%G=[] 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET /A avoidCPU%%G=%%G*13 
     
    SET CPUlength=1 
     
    CALL SET CPU1=%%CCP_NODES:~%avoidCPU1%,%CPUlength%%% 
    CALL SET CPU2=%%CCP_NODES:~%avoidCPU2%,%CPUlength%%% 
    CALL SET CPU3=%%CCP_NODES:~%avoidCPU3%,%CPUlength%%% 
    CALL SET CPU4=%%CCP_NODES:~%avoidCPU4%,%CPUlength%%% 
    CALL SET CPU5=%%CCP_NODES:~%avoidCPU5%,%CPUlength%%% 
    CALL SET CPU6=%%CCP_NODES:~%avoidCPU6%,%CPUlength%%% 
    CALL SET CPU7=%%CCP_NODES:~%avoidCPU7%,%CPUlength%%% 
    CALL SET CPU8=%%CCP_NODES:~%avoidCPU8%,%CPUlength%%% 
     
    FOR /L %%G IN (1,1,%CPU1%) DO echo %node1% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU2%) DO echo %node2% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU3%) DO echo %node3% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU4%) DO echo %node4% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU5%) DO echo %node5% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU6%) DO echo %node6% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU7%) DO echo %node7% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU8%) DO echo %node8% >> \test\fluenthost.txt 
     
    When I submit a job to the scheduler it has two tasks. The first task is to run the batch file which creates a fluenthost.txt file. The second task is to start Fluent using the following command: 

    fluent 3ddp -r6.3.26 -t16 -cnf=fluenthost.txt -mpi=ms -g -hidden -i input-DES.jou > output-DES.jou

    This will run Fluent in parallel on the nodes in the fluenthost.txt file and avoids running a fluentprocess on the head node. It may not be the best way to do what I want but it works.

    Koen
    Monday, July 7, 2008 4:18 PM

All replies

  • Koen --

    Does it work if you try adding -ccp <headnode> to the command line in your job template?

    --

    John S Costello -- Software Dev. Engineer /  Test -- Windows HPC Server 2008

    "The urge to fly from modern systems, instead of moving through them to even greater, fairer things is, I think, an indication of deep weariness and confusion." -- Dwayne Monroe

    Tuesday, June 24, 2008 8:41 PM
    Answerer
  • At the moment extra nodes are being added so I will give it a try once the cluster is up and running again.

    Thanks,

    Koen
    Wednesday, June 25, 2008 8:59 AM
  • Koen --

    While you're at it, I suggest setting IsExclusive="false" in your template.  Here's why:  when you submit the job, it will start a fluent host process on some compute node.  That host process will then attempt to submit another job to the scheduler specified in -ccp <headnode>; this job will perform the actual computation.  But if you have IsExclusive="true" on the host process job, it will consume one entire node's worth of resources while not actually doing much useful work.

    However, if you have IsExclusive="false" then other non-exclusive jobs can be scheduled to run on that node.  For instance, if you have multiple fluent runs going at once, if the host processes are scheduled non-exclusive then they can all run on one node (up to one per processor core).  This won't affect your performance (much), because the host processes are not very resource-intensive jobs.

    --

    John S Costello -- Software Dev. Engineer /  Test -- Windows HPC Server 2008

    "The urge to fly from modern systems, instead of moving through them to even greater, fairer things is, I think, an indication of deep weariness and confusion." -- Dwayne Monroe

    Wednesday, June 25, 2008 8:09 PM
    Answerer
  • Hi,

    should I add -ccp <headnode> or -ccp %CCP_NODES% to the commandline? since I want to run the job on the normal nodes and not use the head node as Fluent host.

    I have another question about submitting Fluent jobs. On our universities Linux cluster there is a script that reads the environment variable which has the list of the assigned nodes and writes these to a file. Then the Fluent command is started which includes a reference to that file. (see below) Is such a script needed on Windows compute cluster server 2003 or does the scheduler take care of this?

    Thanks,

    Koen

    MACHINE="fluent_hosts"
    cat $PBS_NODEFILE | awk '{ printf "%s\n",$1 }' > $MACHINE

    fluent63 3ddp -t$NUMPROCS -pgmpi -cnf=$MACHINE -g -i  input.jou >> output.jou

    Thursday, June 26, 2008 10:50 AM
  •  Koen --

    You should add -ccp <headnode> to the command line.  The sequence will then  be:

    1. Your job runs.  It starts the fluent host process on some compute node (chosen by the CCP scheduler).
    2. The fluent host process submits a job to the scheduler specified in the -ccp <headnode> command line flag. This job starts the worker processes on compute nodes chosen by the CCP scheduler.

    Strictly speaking, your fluent host process doesn't need to run on the cluster at all.  It only needs to run on some machine which has direct network access to the compute nodes of the cluster (it cannot run through a NAT).  In practice, the only machines which can be guaranteed to have such access are the cluster compute nodes and the cluster headnode.

    To answer your second question:  it is possible to specify which machines you want to run your job on with a configuration file, but it is not necessary and if you don't the scheduler willl handle it automatically.

    --

    John S Costello -- Software Dev. Engineer /  Test -- Windows HPC Server 2008

    "The urge to fly from modern systems, instead of moving through them to even greater, fairer things is, I think, an indication of deep weariness and confusion." -- Dwayne Monroe

    Thursday, June 26, 2008 3:56 PM
    Answerer
  • You might also consider setting your HN to not be a Compute Node . . . you can do this by selecting your HN in the Node Management console, taking it Offline, and then selecting "Change Role."


    -Josh
    Thursday, June 26, 2008 5:42 PM
    Moderator
  • Hi,

    I added the -ccp headnode to the command line and a Fluent host process gets started on a compute node. This process then wants to submit a job to the scheduler which requires my password. Since this is all happening in the background I can't enter it and the process continues to ask for my password until I cancel the job. Should I add the password to the job template? (which is not very safe in my opinion) or is there another way to do this?

    Thanks,

    Koen
    Friday, June 27, 2008 10:48 AM
  • Hi,

    I couldn't get it Fluent to work with the -ccp flag so I tried to start Fluent the same way as on the Linux cluster. I created the following batch file which creates a fluenthost.txt file containing the names of the nodes. This is probably not the most beautiful code but it works. 

    SET NofN=%CCP_NODES:~0,1% 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET NODE%%G=[] 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET /A avoid%%G=2+(%%G-1)*13 
     
    SET length=10 
     
    CALL SET node1=%%CCP_NODES:~%avoid1%,%length%%% 
    CALL SET node2=%%CCP_NODES:~%avoid2%,%length%%% 
    CALL SET node3=%%CCP_NODES:~%avoid3%,%length%%% 
    CALL SET node4=%%CCP_NODES:~%avoid4%,%length%%% 
    CALL SET node5=%%CCP_NODES:~%avoid5%,%length%%% 
    CALL SET node6=%%CCP_NODES:~%avoid6%,%length%%% 
    CALL SET node7=%%CCP_NODES:~%avoid7%,%length%%% 
    CALL SET node8=%%CCP_NODES:~%avoid8%,%length%%% 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET CPU%%G=[] 
     
    FOR /L %%G IN (1,1,%NofN%) DO SET /A avoidCPU%%G=%%G*13 
     
    SET CPUlength=1 
     
    CALL SET CPU1=%%CCP_NODES:~%avoidCPU1%,%CPUlength%%% 
    CALL SET CPU2=%%CCP_NODES:~%avoidCPU2%,%CPUlength%%% 
    CALL SET CPU3=%%CCP_NODES:~%avoidCPU3%,%CPUlength%%% 
    CALL SET CPU4=%%CCP_NODES:~%avoidCPU4%,%CPUlength%%% 
    CALL SET CPU5=%%CCP_NODES:~%avoidCPU5%,%CPUlength%%% 
    CALL SET CPU6=%%CCP_NODES:~%avoidCPU6%,%CPUlength%%% 
    CALL SET CPU7=%%CCP_NODES:~%avoidCPU7%,%CPUlength%%% 
    CALL SET CPU8=%%CCP_NODES:~%avoidCPU8%,%CPUlength%%% 
     
    FOR /L %%G IN (1,1,%CPU1%) DO echo %node1% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU2%) DO echo %node2% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU3%) DO echo %node3% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU4%) DO echo %node4% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU5%) DO echo %node5% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU6%) DO echo %node6% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU7%) DO echo %node7% >> \test\fluenthost.txt 
    FOR /L %%G IN (1,1,%CPU8%) DO echo %node8% >> \test\fluenthost.txt 
     
    When I submit a job to the scheduler it has two tasks. The first task is to run the batch file which creates a fluenthost.txt file. The second task is to start Fluent using the following command: 

    fluent 3ddp -r6.3.26 -t16 -cnf=fluenthost.txt -mpi=ms -g -hidden -i input-DES.jou > output-DES.jou

    This will run Fluent in parallel on the nodes in the fluenthost.txt file and avoids running a fluentprocess on the head node. It may not be the best way to do what I want but it works.

    Koen
    Monday, July 7, 2008 4:18 PM
  • Glad you figured something out; thanks for posting your approach here!
    -Josh
    Tuesday, July 8, 2008 6:09 AM
    Moderator
  • Hi Costello I want to run Fl;uent 6.3 on Windows HPC 2008. But Iam not confident on it.
    The first thing is that I want to run without using the job scheduler, (if psossible). The problem is that there is not sufficient help available in the user guide for the Fluent on Win hpc 2008 or 2003.
    Secondly, I want to run using the command prompt.

    Questions

    Do I really need the xml script? As I am not familiar with the XML format

    Cant I use the following command:
    fluent 3d -t2 -pmsmpi -cnf=host             ?

    Do I need RSHD to install on all nodes in winwow64 folder as we used to do it in older versions of windows.

    Kindly reply me on shamoonjamshed@yahoo.com as well.
    Saturday, January 23, 2010 6:26 PM