Hi,
I MS RDP into the Headnode and the compute maangement MMC to manage the compute clusters of two compute nodes + one headnode. However, after operating for a while, the headnode will become unreachable and my MMC will disconnect. Then, after a while, about 5 minutes, the headnode become reachable again. Can anyone advise what when wrong?
I have three networks.
Public network
MPI network
Private network
1. What should be the network binding order? Currently, mine is public, MPI and private. Could this be the caused of the problem?
2. All my three nodes (1 HN and 2 CN) have 8 cores, total: 24 cores. When I submit a job and choose 16 cores, everything run less than 1 minutes. When it goes beyond 16 cores, it seem to run forever? Anyway to tell where the compute cluster is hung at?