WHPCS2008 MPI diagnostics failing randomlyHi, <div><br/></div> <div>I'm trying to help a customer who is running WHPCS 2008 on an SGI cluster (of Altix XE310 servers). WHPCS 2008 passes all the built-in diagnostic tests except for the MPI ping-pong and lightweight throughput tests. These fail on random nodes. The cluster is setup with a private GigE network for management and an Infiniband (IB) network for MPI traffic.  The IB network is connected to a Cisco SFS7000d switch which us running the subnet manager. </div> <div>When running the MPI ping-pong and lightweight throughput tests, multiple nodes fail randomly. For example, the tests are run on a subset of the cluster works OK, but after they get up to around 14 nodes, the test will fail on a random node. (The cluster has about 20 nodes in total.) However, a &quot;clusrun&quot; of simple commands works OK across all nodes. The log output from the MPI tests is as follows:</div> <div><br/></div> <div> <p class=MsoPlainText>Time<span style="">      </span>Message</p> <p class=MsoPlainText>18/06/2009 12:32:06 PM<span style="">                </span>Reverted</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>The operation failed due to errors during execution.</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>The operation failed and will not be retried.</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>---- error analysis -----</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>mpi has detected a fatal error and aborted mpipingpong.exe</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[5] on MICRO15</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>---- error analysis -----</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[6-19] terminated</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEndpoint::ConnReqFailed(407): [ch3:nd] INDConnector::Connect to 192.168.2.111:1 failed with 0x80070043</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEndpoint::Connect(236)......:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEnvironment::Connect(400)...:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_CH3I_VC_post_connect(426)......: MPIDI_CH3I_Nd_connect failed in VC_post_connect</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_CH3_iSendv(239)................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_EagerContigIsend(519)..........: failure occurred while attempting to send an eager message</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIC_Sendrecv(120)...................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIR_Allgather(487)..................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPI_Allgather(864)...................: MPI_Allgather(sbuf=0x000000000022F750, scount=128, MPI_CHAR, rbuf=0x0000000000B71100, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>Fatal error in MPI_Allgather: Other MPI error, error stack:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[5] fatal error</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[0-4] terminated</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[ranks] message</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">             </span><span style="">   </span>job aborted:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:31:50 PM<span style="">                </span>Connecting to scheduler service on node micro.</p> <p class=MsoPlainText> </p> <p class=MsoPlainText>Has anyone seen this type f problem? Any suggestions for resolving it?</p> <p class=MsoPlainText>Thanks very much.</p> <p class=MsoPlainText>Regards,</p> <p class=MsoPlainText>David</p> </div>© 2009 Microsoft Corporation. All rights reserved.Fri, 17 Jul 2009 17:55:14 Zfc113c32-0d71-46ce-8ca1-119252d042f4http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#fc113c32-0d71-46ce-8ca1-119252d042f4http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#fc113c32-0d71-46ce-8ca1-119252d042f4kellydavidhttp://social.microsoft.com/Profile/en-US/?user=kellydavidWHPCS2008 MPI diagnostics failing randomlyHi, <div><br/></div> <div>I'm trying to help a customer who is running WHPCS 2008 on an SGI cluster (of Altix XE310 servers). WHPCS 2008 passes all the built-in diagnostic tests except for the MPI ping-pong and lightweight throughput tests. These fail on random nodes. The cluster is setup with a private GigE network for management and an Infiniband (IB) network for MPI traffic.  The IB network is connected to a Cisco SFS7000d switch which us running the subnet manager. </div> <div>When running the MPI ping-pong and lightweight throughput tests, multiple nodes fail randomly. For example, the tests are run on a subset of the cluster works OK, but after they get up to around 14 nodes, the test will fail on a random node. (The cluster has about 20 nodes in total.) However, a &quot;clusrun&quot; of simple commands works OK across all nodes. The log output from the MPI tests is as follows:</div> <div><br/></div> <div> <p class=MsoPlainText>Time<span style="">      </span>Message</p> <p class=MsoPlainText>18/06/2009 12:32:06 PM<span style="">                </span>Reverted</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>The operation failed due to errors during execution.</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>The operation failed and will not be retried.</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>---- error analysis -----</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>mpi has detected a fatal error and aborted mpipingpong.exe</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[5] on MICRO15</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>---- error analysis -----</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[6-19] terminated</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEndpoint::ConnReqFailed(407): [ch3:nd] INDConnector::Connect to 192.168.2.111:1 failed with 0x80070043</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEndpoint::Connect(236)......:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>CH3_ND::CEnvironment::Connect(400)...:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_CH3I_VC_post_connect(426)......: MPIDI_CH3I_Nd_connect failed in VC_post_connect</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_CH3_iSendv(239)................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIDI_EagerContigIsend(519)..........: failure occurred while attempting to send an eager message</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIC_Sendrecv(120)...................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPIR_Allgather(487)..................:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>MPI_Allgather(864)...................: MPI_Allgather(sbuf=0x000000000022F750, scount=128, MPI_CHAR, rbuf=0x0000000000B71100, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>Fatal error in MPI_Allgather: Other MPI error, error stack:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[5] fatal error</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[0-4] terminated</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span>[ranks] message</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">             </span><span style="">   </span>job aborted:</p> <p class=MsoPlainText>18/06/2009 12:32:05 PM<span style="">                </span></p> <p class=MsoPlainText>18/06/2009 12:31:50 PM<span style="">                </span>Connecting to scheduler service on node micro.</p> <p class=MsoPlainText> </p> <p class=MsoPlainText>Has anyone seen this type f problem? Any suggestions for resolving it?</p> <p class=MsoPlainText>Thanks very much.</p> <p class=MsoPlainText>Regards,</p> <p class=MsoPlainText>David</p> </div>Fri, 19 Jun 2009 06:10:44 Z2009-06-19T06:10:44Zhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#23be5e04-9b72-4757-84f7-5af2a934aa21http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#23be5e04-9b72-4757-84f7-5af2a934aa21Fab Tillier [MS]http://social.microsoft.com/Profile/en-US/?user=Fab%20Tillier%20%5bMS%5dWHPCS2008 MPI diagnostics failing randomlyHi David,<br/><br/>The error you're getting looks like it is related to the IP to IB address translation in the IB drivers.  A few questions:<br/><br/>1. What version of the IB drivers are you running?<br/>2. Is the firmware up to date on both the switch as well as the HCAs?<br/><br/>Thanks,<br/>-FabWed, 24 Jun 2009 00:47:51 Z2009-06-24T00:47:51Zhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#ad5e1af1-0168-4a37-855e-ba9bea1891b7http://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#ad5e1af1-0168-4a37-855e-ba9bea1891b7kellydavidhttp://social.microsoft.com/Profile/en-US/?user=kellydavidWHPCS2008 MPI diagnostics failing randomlyThanks Fab. <div> <p class=MsoPlainText>The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.</p> <p class=MsoPlainText>I've also got &quot;vstat&quot; and &quot;ibdiagnet&quot; output if you like - it's too big to include inline here.</p> <p class=MsoPlainText>Thanks for your help.</p> <p class=MsoPlainText>Regards,</p> <p class=MsoPlainText>David</p> </div>Thu, 25 Jun 2009 23:16:09 Z2009-06-25T23:16:09Zhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#99493faf-3f8f-4c59-bebc-10570279c31dhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#99493faf-3f8f-4c59-bebc-10570279c31dFab Tillier [MS]http://social.microsoft.com/Profile/en-US/?user=Fab%20Tillier%20%5bMS%5dWHPCS2008 MPI diagnostics failing randomly<blockquote>The software installed is WinOF_2_0_5, and from device manager, the Driver Assembly Version is 4335.</blockquote> That looks like the latest. <blockquote>I've also got &quot;vstat&quot; and &quot;ibdiagnet&quot; output if you like - it's too big to include inline here.</blockquote> vstat output would be nice to see.  Can you ping between the machines?<br/><br/>Does the error only occur under stress?<br/><br/>We've seen problems with connectivity when firmware versions were out of date.  While vstat shows the FW version on the HCA, I don't know what tool will show the FW version of the switch - perhaps you can log in to the switch and find out?<br/><br/>Thanks,<br/>-Fab<br/><br/><br/>Mon, 29 Jun 2009 00:52:05 Z2009-06-29T00:52:05Zhttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#36a382b0-8184-4db0-9924-8fe8ee86934ehttp://social.microsoft.com/Forums/en-US/windowshpcmpi/thread/fc113c32-0d71-46ce-8ca1-119252d042f4#36a382b0-8184-4db0-9924-8fe8ee86934ekellydavidhttp://social.microsoft.com/Profile/en-US/?user=kellydavidWHPCS2008 MPI diagnostics failing randomlyHi Fab, <div>Our guys appear to have identified the problem - a bad Infiniband card in one of the servers that seems to be affecting the network as a whole (somehow). Disconnecting the IB card from the IB switch seems to fix the problem. They're going to replace the whole server. Thanks very much for your help. </div> <div><br/></div> <div>Regards,</div> <div><br/>David</div>Wed, 01 Jul 2009 00:07:54 Z2009-07-01T00:07:54Z