Slowdown over IB using HPC 2012R2 and Ansys Fluent RRS feed

  • Question

  • I have recently installed infiniband hardware in our small cluster, and while initial performance metrics through HPC Pack 2012 R2 (Update 3) are good (consistently low latency, high bandwidth), I am seeing some strange behaviour of our primary simulation tool, Ansys Fluent. 

    As soon as I go over about 3 nodes (16 cores each), my simulation stalls every 10 or so iterations. The performance is great between these stalls, much faster than over GbE, for example. Over 8 nodes, the simulation proceeds with extreme speed, and then frustratingly will sit for 10 seconds or so doing nothing, before it picks up again. Over time, the overall simulation time is about the same as over GbE, purely due to these waiting periods. 

    I've looked at all the metrics I can think of through perfmon, and nothing stands out. During these stalls, there is 0 CPU activity across the cluster, and nothing unusual over either the Enterprise or Private (IB) networks. Although it's hard to know what to look for. One clue is that the RDMA traffic goes to 0 during the stall, and there's as slight increase in GbE traffic, although it's barely perceptible and may just be noise. There is 0 CPU activity, and no hard faults, disk I/O or anything like that. 

    Does anyone have any ideas of what may be causing this? ANSYS support are out of ideas, and I have tried pretty much every possible firmware and driver combination available. They claim noone has had this issue, but I think the list of HPC users of Fluent running on windows could probably be counted on both hands. We have a linux cluster running the same hardware that doesn't have these issues. 

    I'm very curious to know if anyone has seen this kind of behaviour before. If not, perhaps you have some ideas based on the behaviour on what to check?

    Wednesday, May 9, 2018 10:30 AM

All replies

  • I've forwarded the issue to our MSMPI team.

    But meantime you could check: whether it is only Ansys Fluent only issue, or other MPI application has the same issue -- for example to run a LinPack Benchmark; thus to isolate whether it is application issue or platform issue

    A second try would be: try a different workload of Ansys Fluent, to see whether it is workload specific

    And meantime, HPC Pack also supports linux.

    Qiufang Shi

    Friday, May 11, 2018 3:20 AM
  • Thank you Qiufang.

    The problem is not Fluent workload specific, it occurs for all case types. 

    As far as I can tell, it also occurs for Linpack (I used Lizard). 

    I have since re-imaged the entire cluster, and the problem disappeared, but not for long. I have noticed that if I restart the OpenSM service, the problem goes away. Over about 12-24 hours, it re-occurs, until the service is restarted. This is the version of opensm that is provided with Mellanox's Win-OF 5.35 driver, the latest available for my hardware. A scan of the subnet doesn't report any unusual errors or problems, at any time. 

    Kind regards,

    Christian Rohr. 

    Wednesday, May 16, 2018 8:38 AM
  • From your description, it sounds like the network driver or network issue. Have you checked with Mellanox to check know issues?

    Qiufang Shi

    Friday, May 18, 2018 8:02 AM