which returns the same thing. Except, I've checked the profile directories - there are no .etl files. The compute nodes are dual quad cores, so I'm assuming MPI is launching 8 instances, but shouldn't it be only tracing from one? It shouldn't be trying to write to the file from 8 sources at once, will it? It's especially confusing because it worked fine on a lab cluster with 8 dual core servers (the lab cluster is the Beta R2 though).
Ok, so here's what I'm seeing - any time the job gets submitted but doesn't complete (invalid command string, missing data, or the job was canceled halfway through) the %userprofile%\mpi_trace_(jobnumber).(tasknumber).x.etl file doesn't get deleted, but more than that the file stays locked open so I can't delete it either through clusrun or by rdp'ing onto the server. I've also tried restarting the msmpi service, no luck. I restarted the nodegroup, that seemed to help the first time out, but then when I reran the job it hit the same problem. restarting the node group after every failed job when tracing is desired is obviously a little less than helpful...
what am I missing here? Why is the trace file staying locked open? Maybe more importantly, why is mpiexec -trace trying to use the SAME trace file, rather than creating mpi_trace_(jobnumber+1).(tasknumber).x.etl for the next job submitted? I'm definitely confused on this one, especially since it seems to work fine under Beta R2.
You are not missing anyting. when a job get canceled the smpd's don't get a chance to cleanup the trace session (its not really about the file, msmpi is actually using another one by default).
run: clusrun logman stop msmpi -etc
to stop the running session. you should be able to run msmpi w/ tracing enabled in the next job.
.Erez
Marked As Answer bythe3dgeFriday, March 26, 2010 12:08 AM