none
mpiexec -trace - can't overwrite file? RRS feed

  • Question

  • Hi folks -

    I'm trying to trace an mpi app.  The command line I'm using is :

    PS C:\> job submit /jobname:MyApp /nodegroup:ComputeNodes /numnodes:16 mpiexec -trace \\headnode\shared\myapp {arguments}

    The job returns:

    Exit Code                       : -1
    Error Message                   : Task failed during execution with exit code -1
    . Please check task's output for error details.
    Output                          :
    Aborting: failed to start tracing on COMPUTENODE01
    Error (183) Cannot create a file when that file already exists.

    I've also tried specifying the -wdir on a shared writable drive, and:

    PS C:\> job submit /jobname:MyApp /nodegroup:ComputeNodes /numnodes:16 mpiexec
    -trace -tf %userprofile%\%computername%_%CCP_JOBID%.%CCP_TASKID%.%CCP_TASKINSTAN
    CEID%.etl \\headnode\shared\myapp

    which returns the same thing.  Except, I've checked the profile directories - there are no .etl files.  The compute nodes are dual quad cores, so I'm assuming MPI is launching 8 instances, but shouldn't it be only tracing from one?  It shouldn't be trying to write to the file from 8 sources at once, will it?  It's especially confusing because it worked fine on a lab cluster with 8 dual core servers (the lab cluster is the Beta R2 though). 

    Open to any ideas.  Thanks. 
    Friday, February 12, 2010 7:08 PM

Answers

  • You are not missing anyting. when a job get canceled the smpd's don't get a chance to cleanup the trace session (its not really about the file, msmpi is actually using another one by default).

    run:
    clusrun logman stop msmpi -etc

    to stop the running session. you should be able to run msmpi w/ tracing enabled in the next job.

    .Erez
    • Marked as answer by the3dge Friday, March 26, 2010 12:08 AM
    Monday, February 15, 2010 9:29 AM

All replies

  • Ok, so here's what I'm seeing - any time the job gets submitted but doesn't complete (invalid command string, missing data, or the job was canceled halfway through) the %userprofile%\mpi_trace_(jobnumber).(tasknumber).x.etl file doesn't get deleted, but more than that the file stays locked open so I can't delete it either through clusrun or by rdp'ing onto the server.  I've also tried restarting the msmpi service, no luck.  I restarted the nodegroup, that seemed to help the first time out, but then when I reran the job it hit the same problem.  restarting the node group after every failed job when tracing is desired is obviously a little less than helpful...

    what am I missing here?  Why is the trace file staying locked open?  Maybe more importantly, why is mpiexec -trace trying to use the SAME trace file, rather than creating mpi_trace_(jobnumber+1).(tasknumber).x.etl for the next job submitted?  I'm definitely confused on this one, especially since it seems to work fine under Beta R2.

    Thanks -
    Sunday, February 14, 2010 9:29 AM
  • You are not missing anyting. when a job get canceled the smpd's don't get a chance to cleanup the trace session (its not really about the file, msmpi is actually using another one by default).

    run:
    clusrun logman stop msmpi -etc

    to stop the running session. you should be able to run msmpi w/ tracing enabled in the next job.

    .Erez
    • Marked as answer by the3dge Friday, March 26, 2010 12:08 AM
    Monday, February 15, 2010 9:29 AM