Resources for IT Professionals > 論壇首頁 > Windows HPC Server V2 Pre-RTM > HPCJobScheduler crash with invalid job
發問發問
 

已答覆HPCJobScheduler crash with invalid job

  • Tuesday, 13 May, 2008 2:49Somsak Sriprayoonsakul 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.

     

    We looked at HPCScheduler.log and found a lot of entries like below.

     

    2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.

     

    It seems like something in the database was corrupted. Are there anyway to fix this?

     

    Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".

解答

所有回覆

  • Tuesday, 13 May, 2008 3:04Somsak Sriprayoonsakul 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    Now we can't submit new job to the system. HPCScheduler crash ( a screen pop-up asking whether we want to debug the application or not) everytime we tried to do anything with HPC Job Manager.

     

     Somsak Sriprayoonsakul wrote:

    We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.

     

    We looked at HPCScheduler.log and found a lot of entries like below.

     

    2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.

     

    It seems like something in the database was corrupted. Are there anyway to fix this?

     

    Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".

  • Tuesday, 13 May, 2008 6:17carter_chenMSFT使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    Hi,

     

    Could you please provide log file to us?

     

    Please run the following PS script on the HN.  This should create a folder called ClusterCfg under the directory you're running the script from. Please zip the directroy and send to me via email. (christc at microsoft dot com) 

     
     

    #Some location information
    $OutputDirName = "ClusterConfig"
    $NetworkInfoFile = "$OutputDirName\NetworkInfo.txt"
    $NodeInfoFile = "$OutputDirName\NodeInfo.txt"
    $HpcLogDir = "$OutputDirName\HpcLogs"
    $LogDir = "$OutputDirName\Logs"

    #Create a directory in which to stash everything
    Echo "Creating directories . . ."
    New-Item -name $OutputDirName  -ItemType directory

    #Get system information
    "Getting system info . . ."
    msinfo32 /report "$OutputDirName\SysInfo.txt"

    #Dump the Network Information to a File
    Echo "Dumping network configuration . . ."
    "Network Topology:" > $NetworkInfoFile
    Get-HpcNetWorkTopology >> $NetworkInfoFile
    "" >> $NetworkInfoFile
    "Network Interfaces:" >> $NetworkInfoFile
    Get-HpcNetworkInterface | Format-List >> $NetworkInfoFile

    #Dump the Node Information to a File
    ECho "Dumping node info . . ."
    Get-HpcNode | sort NetBiosName | Format-List >> $NodeInfoFile

    #Copy over the log files
    Echo "Copying HPC logs . . ."
    robocopy $env:CCP_DATA\Logfiles $HpcLogDir /E

    #Get Event Logs
    Echo "Copying system logs . . ."
    wevtutil epl System "$LogDir\System.evtx"
    Echo "Copying application logs . . ."
    wevtutil epl Application "$LogDir\Application.evtx"

     

     

    Thanks,

    Christina

  • Tuesday, 13 May, 2008 6:24carter_chenMSFT使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     

    Also, what version of Cluster Manager are you using? Please see Help->About for the version number

     

    Thank you,

    Christina

     

  • Wednesday, 14 May, 2008 6:49Somsak Sriprayoonsakul 使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     
    Hi,

        Thanks for quick reply.
        Our cluster manager version is 2.0.1302.0.
        I just send the information to you. Some commands in powershell script failed though. I attached the output of the script in the zipped file.
  • Wednesday, 25 March, 2009 23:57Don PatteeMSFT, 版主使用者勳章使用者勳章使用者勳章使用者勳章使用者勳章
     已答覆
     

    HPC Server 2008 shipped in September 2008, so I'm going through and marking all questions in the beta forum as 'answered'.