locked
Finished Job still reserving a node RRS feed

  • Question

  • I have a 10 core HPC 2008 cluster that I am testing for use with specific software and it works great.  Much more powerful than CCS2003.  I have however run into an odd behavior that I can't seem to get around: 

    A large multi-day job is currently running, and running fine.  I wanted to test submitting a smaller, high priority job that would run quickly and than leave the long job to continue on its way.  This worked great, but after the small job completed it seems to have held a single core (its minimum!?) in reserve and will not let the long job go back to using this core.  I ran the small job again out of curiosity and the second run locked up another core.  So right now I have an entire dual-core compute node that I cannot use for the long multi-day job.

    Any assistance here would be greatly appreciated as resources for HPC2008 are still spotty due to its "beta-ness" and I have been able to find any mention of this elsewhere.  I hope I am just dense and missed something but would appreciate any help I may get.  Thanks!

    Dave

    Monday, June 16, 2008 8:58 PM

Answers

  •  

    Calvin4444,

    Please file a bug using the Feedback link on the left side up on Microsoft Connect, and provide as many details as you can.  Especially what you mean by "Locking up a core" . . . does this mean the job is finished with it's "Current Allocation" above 0?  With tasks still in the Running state?

     

    Thanks,
    Josh

    Monday, June 23, 2008 6:33 PM

All replies

  •  

    Dave,

    Seems like you've found a scheduler bug Smile  I believe this is actually a known issue, but I'd appreciate it if you could open a feedback item on this issue.

     

    To do so:

    Go to http://connect.microsoft.com

    Go to the Microsoft Windows HCP Server 2008 Beta site

    Select "Feedback" in the left-hand navigation bar

    Click the big green submit feedback item

     

    Post the details (which you provide above), and if possible, include exported XML for all jobs that are playing a part in the thing.  It would also help if you could provide us with some scripts/other information, which can be done quite easily if you run the following PowerShell script and send us the folder that it creates (or at very least providing the log files from "%CCP_HOME%\Data\Logfiles"):

     

    Code Snippet

    #Some location information
    $OutputDirName = "ClusterConfig"
    $NetworkInfoFile = "$OutputDirName\NetworkInfo.txt"
    $NodeInfoFile = "$OutputDirName\NodeInfo.txt"
    $HpcLogDir = "$OutputDirName\HpcLogs"
    $LogDir = "$OutputDirName\Logs"

    #Create a directory in which to stash everything
    Echo "Creating directories . . ."
    New-Item -name $OutputDirName  -ItemType directory

    #Get system information
    "Getting system info . . ."
    msinfo32 /report "$OutputDirName\SysInfo.txt"

    #Dump the Network Information to a File
    Echo "Dumping network configuration . . ."
    "Network Topology:" > $NetworkInfoFile
    Get-HpcNetWorkTopology >> $NetworkInfoFile
    "" >> $NetworkInfoFile
    "Network Interfaces:" >> $NetworkInfoFile
    Get-HpcNetworkInterface | Format-List >> $NetworkInfoFile

    #Dump the Node Information to a File
    ECho "Dumping node info . . ."
    Get-HpcNode | sort NetBiosName | Format-List >> $NodeInfoFile

    #Copy over the log files
    Echo "Copying HPC logs . . ."
    robocopy $env:CCP_DATA\Logfiles $HpcLogDir /E

    #Get Event Logs
    Echo "Copying system logs . . ."
    wevtutil epl System "$LogDir\System.evtx"
    Echo "Copying application logs . . ."
    wevtutil epl Application "$LogDir\Application.evtx"

     

    Thanks!

    Josh

    Monday, June 16, 2008 10:00 PM
  •  

    BTW . . . can you confirm that you've tried canceling the job, and what state the jobs (and their tasks) are in?

     

    Thanks,
    Josh

    Monday, June 16, 2008 10:09 PM
  • The two jobs that seem to have each locked up a compute core are both finished successfully.  I did not cancel them while they were running, I let them run naturally to completion.  They were very quick.  I think I am going to cancel the long job and restart it (it is just there for this sort of testing anyway).  Before I do that let me re-run the small job and cancel it midway and see what it does...

    Well, not surprising, but informative.  Cancelling the job does not change the net effect.  I canceled the short job mid-run it still is locking up a core.  So now I am down to 7 for the long job.  This is what I wanted to test!

    Thanks for the quick feedback.  I will attempt to report the bug as best I can.
    Tuesday, June 17, 2008 3:20 PM
  •  

    Calvin4444,

    Please file a bug using the Feedback link on the left side up on Microsoft Connect, and provide as many details as you can.  Especially what you mean by "Locking up a core" . . . does this mean the job is finished with it's "Current Allocation" above 0?  With tasks still in the Running state?

     

    Thanks,
    Josh

    Monday, June 23, 2008 6:33 PM