I have been experiencing a frustrating job execution delay problem that is erratic. The same job that normally takes 0.5-7 secs will sometimes take 21-22 secs. The job is a simple matrix inversion (testing) that takes 4 input parameters at the command line,
and writes out about a dozen result values. The job has been run using the cluster manager and with a c# console exe program. Same results for both methods of job scheduling.
The cluster has multiple nodes with multiple cores, but this is the only job that is run. The job is only run once at a time for testing (no queue). For 4-6 times in a row, the program runs correctly in the 0.5-7 secs range, and then randomly, a job will
take 21-22 secs (pretty consistent at 21 secs). The job only uses one core on one node. In all cases, the job finished successfully, it is just the time length is a problem.
Using HPC 2008 R2 SP1 on a blade system with both a private and enterprise network. Is there something going on with HPC that periodically causes a random job to take longer to execute? Is 20-21 sec delay significant? Tried changing varies job schedule parameters
(isExclusive, etc) and nothing seems to have an effect.
Any comment or idea would be appreciated. Thanks. Kurt