none
MPI Error

    Question

  • Hi.

    I'm setting up a Windows HPC Server 2008 R2 SP1 cluster (HPC Pack 2008 R2 SP2). It contains one Head node and five workstations. The purpose of it is to run Abaqus jobs on the HPC. When running larger jobs (abaqus -job "jobname" -cpus 8 -input CCshort_layup_FO_load.inp) (smaller jobs execute successfully) I get the following error messages:

    [0] process exited without calling finalize

    [1] fatal error
    Fatal error in PMPI_Send: Other MPI error, error stack:
    PMPI_Send(150)..........: MPI_Send(buf=0x0000000031C10040, count=833568, MPI_BYTE, dest=0, tag=0, comm=0x84000000) failed
    MPIDI_CH3I_Progress(235):
    RecvFailed(105).........:
    ReadFailed(1274)........: An existing connection was forcibly closed by the remote host.  (errno 10054)

    [2-3] terminated
    [0] on client1
    C:\Users\ADMINI~1\AppData\Local\Temp\Administrator_ava_2692\jobname.bat ended prematurely and may have crashed. exit code 0xc0000374

    All nodes are running Intel Core 2 CPU:s @ 2.13GHz with 2GB of RAM. Workstations have been installed with Windows 7 x64 SP1. Do you have any ideas what it is causing this behavior? Please help me out.


    • Edited by Zorky Friday, 11 November, 2011 1:44 PM
    Tuesday, 8 November, 2011 1:20 PM

Answers

  • Hi every body!

    I'am sorry that i haven't reply earlier.

    The problem that cause my error messages was a corrupt file. The corrupt file was the job file "CCshort_layup_FO_load.inp". When i got a new job file that was working. I nerver received the error messages again.

    Thank you all for trying to help me!

    Best regards Zorky


    Zorky HPC
    • Marked as answer by Zorky Tuesday, 31 January, 2012 2:37 PM
    Tuesday, 31 January, 2012 2:37 PM

All replies

  • There are several different problems that can lead to this error

    1) Firewall issues: Is firewall enabled on the network on which this MPI job is run on? If firewall is enabled, is Abaqus added to the list of enabled application? You can use hpcfwutil to register abaqus (assuming the executable is abq692.exe)

    clusrun hpcfwutil register abaqus C:\abaqus\692\abq692.exe

    2) Network issues: If firewall isn't a problem, have you tried running MPI Diagnostics on the cluster to see if the network is functioning properly? Select the nodes that you want to run diagnostics on, and right click, choose "Run Diagnostics", then choose all three MPI diagnostics tests to run. Check the diagnostic results to make sure they all look good.

    3) If the two above suggestions still do not work. Can you give more details on the cluster setup? In particular, the network configuration of the cluster (does MPI run on a private network, or enterprise network, is network direct enabled, etc..)

     

    Wednesday, 30 November, 2011 7:09 PM
  • There are several different problems that can lead to this error

    1) Firewall issues: Is firewall enabled on the network on which this MPI job is run on? If firewall is enabled, is Abaqus added to the list of enabled application? You can use hpcfwutil to register abaqus (assuming the executable is abq692.exe)

    clusrun hpcfwutil register abaqus C:\abaqus\692\abq692.exe

    2) Network issues: If firewall isn't a problem, have you tried running MPI Diagnostics on the cluster to see if the network is functioning properly? Select the nodes that you want to run diagnostics on, and right click, choose "Run Diagnostics", then choose all three MPI diagnostics tests to run. Check the diagnostic results to make sure they all look good.

    3) If the two above suggestions still do not work. Can you give more details on the cluster setup? In particular, the network configuration of the cluster (does MPI run on a private network, or enterprise network, is network direct enabled, etc..)

     

    Well, I'm having a similar problem after installing HPC Pack SP3, before I was using the HPC Pack R2. By the way firewall service is disabled on all the compute nodes. But I tried to enable them agian and used the 1st option that you said, but nothing changed. MPI Diagnostic is not working too, it gives error on all nodes. Here is the error of the Radioss run, hope you could help:


    ROOT: shells_regular_00 RESTART: 0001
    NUMBER OF HMPP PROCESSES 48
    10/12/2011

    job aborted:
    [ranks] message

    [0] fatal error
    Fatal error in PMPI_Gather: Invalid buffer pointer, error stack:
    PMPI_Gather(583): MPI_Gather(sbuf=0x000000000B70EC00, scount=10, MPI_DOUBLE_PRECISION, rbuf=0x000000000B70EC00, rcount=10, MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
    PMPI_Gather(507): Buffers must not be aliased

    [1-15] terminated

    ---- error analysis -----

    [0] on HEXANALIZSRV01
    mpi has detected a fatal error and aborted \\hexanalizsrv01\hw11\e_11.0_win64_msmpi.exe

    ---- error analysis -----

    Saturday, 10 December, 2011 4:25 PM
  • Can you give me the error of the failing MPI Diagnostics?

    The error by Radioss suggested that the application is using the same buffer for the sending buffer and the receiving buffer, which is prohibited by the MPI standard (although some MPI implementations will allow it). The application has to use different buffers for sending/receiving or use MPI_IN_PLACE

    http://www.mpi-forum.org/docs/mpi-20-html/node145.htm

     

    --Anh

    Wednesday, 4 January, 2012 5:40 PM
  • I was wondering what is the output of MPI Diagnostic for latency and throughput?

    Thanks,

    James

    Saturday, 28 January, 2012 1:03 AM
  • Hi every body!

    I'am sorry that i haven't reply earlier.

    The problem that cause my error messages was a corrupt file. The corrupt file was the job file "CCshort_layup_FO_load.inp". When i got a new job file that was working. I nerver received the error messages again.

    Thank you all for trying to help me!

    Best regards Zorky


    Zorky HPC
    • Marked as answer by Zorky Tuesday, 31 January, 2012 2:37 PM
    Tuesday, 31 January, 2012 2:37 PM
  • Hi,

    you can detailled the solution , because i have the same problem

    thank you

    Wednesday, 21 May, 2014 10:30 AM