Note: Forums will be making significant UX changes to address key usability improvements surrounding search, discoverability and navigation. To learn more about these changes please visit the announcement which can be found HERE.

Con risposta mpiexec -trace - can't overwrite file?

  • venerdì 12 febbraio 2010 19:08
     
     
    Hi folks -

    I'm trying to trace an mpi app.  The command line I'm using is :

    PS C:\> job submit /jobname:MyApp /nodegroup:ComputeNodes /numnodes:16 mpiexec -trace \\headnode\shared\myapp {arguments}

    The job returns:

    Exit Code                       : -1
    Error Message                   : Task failed during execution with exit code -1
    . Please check task's output for error details.
    Output                          :
    Aborting: failed to start tracing on COMPUTENODE01
    Error (183) Cannot create a file when that file already exists.

    I've also tried specifying the -wdir on a shared writable drive, and:

    PS C:\> job submit /jobname:MyApp /nodegroup:ComputeNodes /numnodes:16 mpiexec
    -trace -tf %userprofile%\%computername%_%CCP_JOBID%.%CCP_TASKID%.%CCP_TASKINSTAN
    CEID%.etl \\headnode\shared\myapp

    which returns the same thing.  Except, I've checked the profile directories - there are no .etl files.  The compute nodes are dual quad cores, so I'm assuming MPI is launching 8 instances, but shouldn't it be only tracing from one?  It shouldn't be trying to write to the file from 8 sources at once, will it?  It's especially confusing because it worked fine on a lab cluster with 8 dual core servers (the lab cluster is the Beta R2 though). 

    Open to any ideas.  Thanks. 

Tutte le risposte

  • domenica 14 febbraio 2010 09:29
     
     
    Ok, so here's what I'm seeing - any time the job gets submitted but doesn't complete (invalid command string, missing data, or the job was canceled halfway through) the %userprofile%\mpi_trace_(jobnumber).(tasknumber).x.etl file doesn't get deleted, but more than that the file stays locked open so I can't delete it either through clusrun or by rdp'ing onto the server.  I've also tried restarting the msmpi service, no luck.  I restarted the nodegroup, that seemed to help the first time out, but then when I reran the job it hit the same problem.  restarting the node group after every failed job when tracing is desired is obviously a little less than helpful...

    what am I missing here?  Why is the trace file staying locked open?  Maybe more importantly, why is mpiexec -trace trying to use the SAME trace file, rather than creating mpi_trace_(jobnumber+1).(tasknumber).x.etl for the next job submitted?  I'm definitely confused on this one, especially since it seems to work fine under Beta R2.

    Thanks -
  • lunedì 15 febbraio 2010 09:29
     
     Con risposta
    You are not missing anyting. when a job get canceled the smpd's don't get a chance to cleanup the trace session (its not really about the file, msmpi is actually using another one by default).

    run:
    clusrun logman stop msmpi -etc

    to stop the running session. you should be able to run msmpi w/ tracing enabled in the next job.

    .Erez
    • Contrassegnato come risposta the3dge venerdì 26 marzo 2010 00:08
    •