Note:
Forums will be making significant UX changes to address key usability improvements surrounding search, discoverability and navigation.
To learn more about these changes please visit the announcement which can be found
HERE.
which returns the same thing. Except, I've checked the profile directories - there are no .etl files. The compute nodes are dual quad cores, so I'm assuming MPI is launching 8 instances, but shouldn't it be only tracing from one? It shouldn't be trying to write to the file from 8 sources at once, will it? It's especially confusing because it worked fine on a lab cluster with 8 dual core servers (the lab cluster is the Beta R2 though).
Ok, so here's what I'm seeing - any time the job gets submitted but doesn't complete (invalid command string, missing data, or the job was canceled halfway through) the %userprofile%\mpi_trace_(jobnumber).(tasknumber).x.etl file doesn't get deleted, but more than that the file stays locked open so I can't delete it either through clusrun or by rdp'ing onto the server. I've also tried restarting the msmpi service, no luck. I restarted the nodegroup, that seemed to help the first time out, but then when I reran the job it hit the same problem. restarting the node group after every failed job when tracing is desired is obviously a little less than helpful...
what am I missing here? Why is the trace file staying locked open? Maybe more importantly, why is mpiexec -trace trying to use the SAME trace file, rather than creating mpi_trace_(jobnumber+1).(tasknumber).x.etl for the next job submitted? I'm definitely confused on this one, especially since it seems to work fine under Beta R2.
You are not missing anyting. when a job get canceled the smpd's don't get a chance to cleanup the trace session (its not really about the file, msmpi is actually using another one by default).
run: clusrun logman stop msmpi -etc
to stop the running session. you should be able to run msmpi w/ tracing enabled in the next job.
.Erez
Contrassegnato come rispostathe3dgevenerdì 26 marzo 2010 00:08