none
HPCJobScheduler crash with invalid job

    Domanda

  • We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.

     

    We looked at HPCScheduler.log and found a lot of entries like below.

     

    2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.

     

    It seems like something in the database was corrupted. Are there anyway to fix this?

     

    Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".

    martedì 13 maggio 2008 02:49

Risposte

Tutte le risposte

  • Now we can't submit new job to the system. HPCScheduler crash ( a screen pop-up asking whether we want to debug the application or not) everytime we tried to do anything with HPC Job Manager.

     

     Somsak Sriprayoonsakul wrote:

    We have a serious proble with job manager. We submitted a job using the "parametric sweep" link on the right of HPCJobManager console. Somehow the job details seems to be corrupted and it crash the console everytime we tried to look at the job details. We found that the service "HPC Job Scheduler" was crashed by just viewing the job details thus the console freezed. The "job view" command, with the errnous job id, also crashed the service too. Sometimes we can view the job details for a short while, but trying to do anything to the job (cancel/modify) crashed "HPCScheduler" services.

     

    We looked at HPCScheduler.log and found a lot of entries like below.

     

    2008/05/13 09:30:01 [5][RC] [Error] Unexpected error when process message for job 369. Detail: Object reference not set to an instance of an object.

     

    It seems like something in the database was corrupted. Are there anyway to fix this?

     

    Right now what we do is to ignore the job completely, not clicking on the job, not trying to look at details of job. Note that, the job state is "Running" but the task state is "failed".

    martedì 13 maggio 2008 03:04
  • Hi,

     

    Could you please provide log file to us?

     

    Please run the following PS script on the HN.  This should create a folder called ClusterCfg under the directory you're running the script from. Please zip the directroy and send to me via email. (christc at microsoft dot com) 

     
     

    #Some location information
    $OutputDirName = "ClusterConfig"
    $NetworkInfoFile = "$OutputDirName\NetworkInfo.txt"
    $NodeInfoFile = "$OutputDirName\NodeInfo.txt"
    $HpcLogDir = "$OutputDirName\HpcLogs"
    $LogDir = "$OutputDirName\Logs"

    #Create a directory in which to stash everything
    Echo "Creating directories . . ."
    New-Item -name $OutputDirName  -ItemType directory

    #Get system information
    "Getting system info . . ."
    msinfo32 /report "$OutputDirName\SysInfo.txt"

    #Dump the Network Information to a File
    Echo "Dumping network configuration . . ."
    "Network Topology:" > $NetworkInfoFile
    Get-HpcNetWorkTopology >> $NetworkInfoFile
    "" >> $NetworkInfoFile
    "Network Interfaces:" >> $NetworkInfoFile
    Get-HpcNetworkInterface | Format-List >> $NetworkInfoFile

    #Dump the Node Information to a File
    ECho "Dumping node info . . ."
    Get-HpcNode | sort NetBiosName | Format-List >> $NodeInfoFile

    #Copy over the log files
    Echo "Copying HPC logs . . ."
    robocopy $env:CCP_DATA\Logfiles $HpcLogDir /E

    #Get Event Logs
    Echo "Copying system logs . . ."
    wevtutil epl System "$LogDir\System.evtx"
    Echo "Copying application logs . . ."
    wevtutil epl Application "$LogDir\Application.evtx"

     

     

    Thanks,

    Christina

    martedì 13 maggio 2008 06:17
  • Also, what version of Cluster Manager are you using? Please see Help->About for the version number

     

    Thank you,

    Christina

     

    martedì 13 maggio 2008 06:24
  • Hi,

        Thanks for quick reply.
        Our cluster manager version is 2.0.1302.0.
        I just send the information to you. Some commands in powershell script failed though. I attached the output of the script in the zipped file.
    mercoledì 14 maggio 2008 06:49
  •  

    HPC Server 2008 shipped in September 2008, so I'm going through and marking all questions in the beta forum as 'answered'.

    mercoledì 25 marzo 2009 23:57