2008年6月16日 20:58I have a 10 core HPC 2008 cluster that I am testing for use with specific software and it works great. Much more powerful than CCS2003. I have however run into an odd behavior that I can't seem to get around:
A large multi-day job is currently running, and running fine. I wanted to test submitting a smaller, high priority job that would run quickly and than leave the long job to continue on its way. This worked great, but after the small job completed it seems to have held a single core (its minimum!?) in reserve and will not let the long job go back to using this core. I ran the small job again out of curiosity and the second run locked up another core. So right now I have an entire dual-core compute node that I cannot use for the long multi-day job.
Any assistance here would be greatly appreciated as resources for HPC2008 are still spotty due to its "beta-ness" and I have been able to find any mention of this elsewhere. I hope I am just dense and missed something but would appreciate any help I may get. Thanks!
Seems like you've found a scheduler bug I believe this is actually a known issue, but I'd appreciate it if you could open a feedback item on this issue.
To do so:
Go to the Microsoft Windows HCP Server 2008 Beta site
Select "Feedback" in the left-hand navigation bar
Click the big green submit feedback item
Post the details (which you provide above), and if possible, include exported XML for all jobs that are playing a part in the thing. It would also help if you could provide us with some scripts/other information, which can be done quite easily if you run the following PowerShell script and send us the folder that it creates (or at very least providing the log files from "%CCP_HOME%\Data\Logfiles"):Code Snippet
#Some location information
$OutputDirName = "ClusterConfig"
$NetworkInfoFile = "$OutputDirName\NetworkInfo.txt"
$NodeInfoFile = "$OutputDirName\NodeInfo.txt"
$HpcLogDir = "$OutputDirName\HpcLogs"
$LogDir = "$OutputDirName\Logs"
#Create a directory in which to stash everything
Echo "Creating directories . . ."
New-Item -name $OutputDirName -ItemType directory
#Get system information
"Getting system info . . ."
msinfo32 /report "$OutputDirName\SysInfo.txt"
#Dump the Network Information to a File
Echo "Dumping network configuration . . ."
"Network Topology:" > $NetworkInfoFile
Get-HpcNetWorkTopology >> $NetworkInfoFile
"" >> $NetworkInfoFile
"Network Interfaces:" >> $NetworkInfoFile
Get-HpcNetworkInterface | Format-List >> $NetworkInfoFile
#Dump the Node Information to a File
ECho "Dumping node info . . ."
Get-HpcNode | sort NetBiosName | Format-List >> $NodeInfoFile
#Copy over the log files
Echo "Copying HPC logs . . ."
robocopy $env:CCP_DATA\Logfiles $HpcLogDir /E
#Get Event Logs
Echo "Copying system logs . . ."
wevtutil epl System "$LogDir\System.evtx"
Echo "Copying application logs . . ."
wevtutil epl Application "$LogDir\Application.evtx"
BTW . . . can you confirm that you've tried canceling the job, and what state the jobs (and their tasks) are in?
2008年6月17日 15:20The two jobs that seem to have each locked up a compute core are both finished successfully. I did not cancel them while they were running, I let them run naturally to completion. They were very quick. I think I am going to cancel the long job and restart it (it is just there for this sort of testing anyway). Before I do that let me re-run the small job and cancel it midway and see what it does...
Well, not surprising, but informative. Cancelling the job does not change the net effect. I canceled the short job mid-run it still is locking up a core. So now I am down to 7 for the long job. This is what I wanted to test!
Thanks for the quick feedback. I will attempt to report the bug as best I can.
Please file a bug using the Feedback link on the left side up on Microsoft Connect, and provide as many details as you can. Especially what you mean by "Locking up a core" . . . does this mean the job is finished with it's "Current Allocation" above 0? With tasks still in the Running state?