locked
Running GigE benchmarks on Infiniband CCS 2003 cluster? RRS feed

  • Question

  • Hi,
    I'm trying to run some comparisons between GigE and Infiniband performance on my CCS 2003 (v1) cluster using a homegrown FORTRAN code.

    I ran the Infiniband cases up to 16 procs (4 nodes) by setting  MPICH_NETMASK appropriately (192.168.160/255.255.255.0).

    If I now try and run the same scalability test cases it runs fine on 8 procs (2 nodes), but fails on 16 procs (4 nodes). I cleared the Environment variable to blank, this does not work. I also tried explicitly allocating the GigE network by setting MPICH_NETMASK=192.168.60.0/255.255.255.0 (which is the GigE network). This does not work either.

    It appears that it is still trying to use the Infiniband network when I do not want it to. Does the cluster keep the MPICH_NETMASK settings from a job (i.e. from when I set it to Infiniband) and apply it across all future jobs, or does it revert to a default setting?

    Stderr output is below...

    Any ideas?

    Many thanks,
     Kenji
    www.mihpc.net


    Stderr looks like this:

    job aborted:
    rank: node: exit code: message
    0: SB-NODE027: terminated
    1: SB-NODE027: terminated
    2: SB-NODE027: terminated
    3: SB-NODE027: terminated
    4: SB-NODE025: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=4, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 4 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    5: SB-NODE025: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=5, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 5 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    6: SB-NODE025: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=6, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 6 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    7: SB-NODE025: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=7, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 7 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    8: SB-NODE033: terminated
    9: SB-NODE033: terminated
    10: SB-NODE033: terminated
    11: SB-NODE033: terminated
    12: SB-NODE031: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=12, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 12 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    13: SB-NODE031: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=13, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 13 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    14: SB-NODE031: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=14, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 14 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0
    15: SB-NODE031: fatal error: Fatal error in MPI_Ssend: Other MPI error, error stack:
    MPI_Ssend(166)......................: MPI_Ssend(buf=0x000000000198FC90, count=1, MPI_INTEGER, dest=0, tag=15, MPI_COMM_WORLD) failed
    MPID_Ssend(148).....................: failure occurred while attempting to send an eager message
    MPIDI_CH3_iSendv_internal(242)......:
    MPIDI_CH3I_Sock_connect(381)........: [ch3:sock] rank 15 unable to connect to rank 0 using business card <port=3365 description="152.78.60.170 192.168.60.175 sb-node027 " shm_host=sb-node027 shm_queue=10BD71FC-2CBD-4ac0-97A4-9C3646F5EDFA >
    MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027  on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0

    ---- error analysis -----

    4: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE025
    5: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE025
    6: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE025
    7: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE025
    12: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE031
    13: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE031
    14: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE031
    15: mpi has detected a fatal error and aborted sotoncaa.exe run on SB-NODE031

    ---- error analysis -----


    Kenji
    Wednesday, October 1, 2008 2:10 PM

Answers

  • Hi, Kenji. 

    As you've surmised, this part of the error spew:  
        "MPIDU_Sock_post_connect_filter(1258): unable to connect to 152.78.60.170 192.168.60.175 sb-node027
         on port 3365, no endpoint matches the netmask 192.168.160.0/255.255.255.0" 
    means that MS-MPI is trying to use the 192.168.160.x subnet but that only 152.78.xx.xx and 192.168.60.xx connections were found.  So, as you suggest, we need to set the MPI traffic to the 192.168.60.xx subnet for your GigE test. 

    The answer to your question, "Does the cluster keep the MPICH_NETMASK settings from a job (i.e. from when I set it to Infiniband) and apply it across all future jobs, or does it revert to a default setting?" depends on how you set the environment variable. 

    There are 3 methods are (in order of lowest to highest precedence): 

    1. Set the MPI network in the Networking portion of the Administrator's Console  [this setting is persistent]
    2. Set the CCP_MPI_NETMASK environment variable using the cluscfg command line tool [this setting is persistent]
          cluscfg setenvs CCP_MPI_NETMASK=xxx.xxx.xxx.xxx/yyy.yyy.yyy.yyy
    3. Set the MPICH_NETMASK environment variable in the mpiexec command for your application [this setting exists only in the context of the running job and dies when the job is complete]
          job submit /numcores:64
      mpiexec  -env  MPICH_NETMASK  157.59.0.0/255.255.0.0  myApp.exe

    I suggest you use #3 as it's easiest to change from job to job, and you don't have to be a cluster admin to do it. 
    PLEASE NOTE:  It's easy to get the syntax wrong on this command...note the space character between MPICH_NETMASK and the value instead of an equal sign ("=") which is normally used when setting environment variables in a Windows command line (we kept the space syntax for max compatibility with the MPICH MPI stack from ANL). 

    Does that help? 

    Thanks,
    Eric



    Eric Lantz (Microsoft)
    • Marked as answer by Don Pattee Thursday, March 26, 2009 12:48 AM
    Wednesday, October 1, 2008 9:23 PM