Friday, March 18, 2011 4:58 AM
I have been experiencing a frustrating job execution delay problem that is erratic. The same job that normally takes 0.5-7 secs will sometimes take 21-22 secs. The job is a simple matrix inversion (testing) that takes 4 input parameters at the command line, and writes out about a dozen result values. The job has been run using the cluster manager and with a c# console exe program. Same results for both methods of job scheduling.
The cluster has multiple nodes with multiple cores, but this is the only job that is run. The job is only run once at a time for testing (no queue). For 4-6 times in a row, the program runs correctly in the 0.5-7 secs range, and then randomly, a job will take 21-22 secs (pretty consistent at 21 secs). The job only uses one core on one node. In all cases, the job finished successfully, it is just the time length is a problem.
Using HPC 2008 R2 SP1 on a blade system with both a private and enterprise network. Is there something going on with HPC that periodically causes a random job to take longer to execute? Is 20-21 sec delay significant? Tried changing varies job schedule parameters (isExclusive, etc) and nothing seems to have an effect.
Any comment or idea would be appreciated. Thanks. Kurt
Tuesday, March 22, 2011 1:43 AM
Update on the delay issue. I set up another small cluster (4 compute nodes) using a different topo setup. Previously it was topo 2 (private and enterprise networks). The new one was topo 5 (enterprise only). The system was setup similarly, but a few roles were not needed since it was using enterprise only. Still used HPC 2008 R2 SP1 on all computer and head node.
There has been no delays on any compute node with the new topo 5 setup. Still do not know the source of the delay problem on the first cluster. Maybe an installation problem(?). For my application, topo 5 will work (no MPI data passing) so I am just going to stay with the enterprise only network.
Tuesday, May 10, 2011 1:35 AM
The only issue I can think of time delay is signing/cert verification. Has your application requires cert? And during execution, the cert manager try to validate your cert through internet and somehow it times out.