Answered by:
HPCJobScheduler crash with invalid job

Question
-
We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.
We looked at HPCScheduler.log and found a lot of entries like below.
2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.
It seems like something in the database was corrupted. Are there anyway to fix this?
Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".
Tuesday, May 13, 2008 2:49 AM
Answers
-
HPC Server 2008 shipped in September 2008, so I'm going through and marking all questions in the beta forum as 'answered'.
- Marked as answer by Don PatteeModerator Wednesday, March 25, 2009 11:57 PM
Wednesday, March 25, 2009 11:57 PMModerator
All replies
-
Now we can't submit new job to the system. HPCScheduler crash ( a screen pop-up asking whether we want to debug the application or not) everytime we tried to do anything with HPC Job Manager.
Somsak Sriprayoonsakul wrote: We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.
We looked at HPCScheduler.log and found a lot of entries like below.
2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.
It seems like something in the database was corrupted. Are there anyway to fix this?
Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".
Tuesday, May 13, 2008 3:04 AM -
Hi,
Could you please provide log file to us?
Please run the following PS script on the HN. This should create a folder called ClusterCfg under the directory you're running the script from. Please zip the directroy and send to me via email. (christc at microsoft dot com)
#Some location information
$OutputDirName = "ClusterConfig"
$NetworkInfoFile = "$OutputDirName\NetworkInfo.txt"
$NodeInfoFile = "$OutputDirName\NodeInfo.txt"
$HpcLogDir = "$OutputDirName\HpcLogs"
$LogDir = "$OutputDirName\Logs"#Create a directory in which to stash everything
Echo "Creating directories . . ."
New-Item -name $OutputDirName -ItemType directory#Get system information
"Getting system info . . ."
msinfo32 /report "$OutputDirName\SysInfo.txt"#Dump the Network Information to a File
Echo "Dumping network configuration . . ."
"Network Topology:" > $NetworkInfoFile
Get-HpcNetWorkTopology >> $NetworkInfoFile
"" >> $NetworkInfoFile
"Network Interfaces:" >> $NetworkInfoFile
Get-HpcNetworkInterface | Format-List >> $NetworkInfoFile#Dump the Node Information to a File
ECho "Dumping node info . . ."
Get-HpcNode | sort NetBiosName | Format-List >> $NodeInfoFile#Copy over the log files
Echo "Copying HPC logs . . ."
robocopy $env:CCP_DATA\Logfiles $HpcLogDir /E#Get Event Logs
Echo "Copying system logs . . ."
wevtutil epl System "$LogDir\System.evtx"
Echo "Copying application logs . . ."
wevtutil epl Application "$LogDir\Application.evtx"Thanks,
Christina
Tuesday, May 13, 2008 6:17 AM -
Also, what version of Cluster Manager are you using? Please see Help->About for the version number
Thank you,
Christina
Tuesday, May 13, 2008 6:24 AM -
Hi,
Thanks for quick reply.
Our cluster manager version is 2.0.1302.0.
I just send the information to you. Some commands in powershell script failed though. I attached the output of the script in the zipped file.Wednesday, May 14, 2008 6:49 AM -
HPC Server 2008 shipped in September 2008, so I'm going through and marking all questions in the beta forum as 'answered'.
- Marked as answer by Don PatteeModerator Wednesday, March 25, 2009 11:57 PM
Wednesday, March 25, 2009 11:57 PMModerator