none
MPI on IPoIB errors RRS feed

  • Question

  • I've been trying to run the hpl code from the instructions I found on the "old" site.

    Following is what happens when I try mpiexec -n 4 xhpl.exe on a single node - I've tried this on multiple nodes and essentially receive the same results.  Has anyone encountered such an issue and if so how can this be resolved?

    The MPI network is Infiniband (IB). Thanks

     

    ============================================================================
    HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 2004
    Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
    ============================================================================

    An explanation of the input/output parameters follows:
    T/V    : Wall time / encoded variant.
    N      : The order of the coefficient matrix A.
    NB     : The partitioning blocking factor.
    P      : The number of process rows.
    Q      : The number of process columns.
    Time   : Time in seconds to solve the linear system.
    Gflops : Rate of execution for solving the linear system.

    The following parameter values will be used:

    N      :      29       30       34       35
    NB     :       1        2        3        4
    PMAP   : Row-major process mapping
    P      :       2        1        4
    Q      :       2        4        1
    PFACT  :    Left    Crout    Right
    NBMIN  :       2        4
    NDIV   :       2
    RFACT  :    Left    Crout    Right
    BCAST  :   1ring
    DEPTH  :       0
    SWAP   : Mix (threshold = 64)
    L1     : transposed form
    U      : transposed form
    EQUIL  : yes
    ALIGN  : 8 double precision words

    ----------------------------------------------------------------------------

    - The matrix A is randomly generated for each test.
    - The following scaled residual checks will be computed:
       1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
       2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
       3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
    - The relative machine precision (eps) is taken to be         1.110223e-016
    - Computational tests pass if scaled residuals are less than           16.0


    job aborted:
    rank: node: exit code: message
    0: cats002: 1: process exited without calling finalize
    1: cats002: terminated
    2: cats002: 1: process exited without calling finalize
    3: cats002: terminated

    ---- error analysis -----

    0: xhpl.exe ended prematurely and may have crashed on cats002
    2: xhpl.exe ended prematurely and may have crashed on cats002

    ---- error analysis -----

    Friday, November 16, 2007 2:22 PM

Answers

  • Ran into the same problems when I first tried the linpack.
    My suggestion is, that you try to run it from console on the developer machine.
    Thus typing mpiexec -n 4 xhpl.exe
    If this works, remotly log into a node and try it there.
    Finally you can place the executable on that node, e.g. c:\temp\hpl\xhpl.exe and run a single node job, using this executable.
    That way you avoid hitting any of the unc path issues.
    To test the scheduler settings now, place the executable on c:\temp\hpl on 2 nodes and try to run on them.

    Hope that helps to determine the problem.


    Johannes

    Wednesday, November 21, 2007 10:42 AM