Robustness of HPC Events RRS feed

  • Question

  • We currently have a requirement to provide an entry point that can run multiple jobs on HPC and not return until all jobs have completed. We were planning to use HPC events however we've noticed issues with jobs getting suck in the "running" state and events not been correctly to returned to the client.

    Given the conseqences of not getting the event is a hung application are events the right choice for this?

    Should we revert to polling HPC instead (at least then we know we will end eventually if we limit based on checks or time)

    What's the best practice in this case?

    Friday, November 4, 2011 1:01 PM

All replies

  • We have this problem too. Did you solve this problem?

    Mato Grencohomepage

    • Edited by mato grenco Monday, November 7, 2011 8:32 AM
    Monday, November 7, 2011 8:23 AM
  • Similar problem here.

    When sending large number of jobs in parallel manner some jobs keep runnning until canceled (one job should be finished within seconds).The job status in the upper window (Cluster Manager/Node Management) for these jobs is "Running" while the status in the lower window is "Dispatching".The cluster size seems to have no impact.I could reproduce this problem on small cluster (~20 cores)  as well as on large cluster ( > 500 cores).

    To reporduce this behaviour I used the below script.


    for /L %%i in (1,1,500) do (
     for /L %%j in (1,1,20) do (
      start job submit /jobname:submit hostname.exe
    sleep 3
    hostname.exe is Windows command that displays the name of computer

    Cluster configuration:

    -head node (head node role with domain controller,SQL Server Express version)

    -session node - (the script is run from here)

    -client nodes (1 node - 16 cores , client nodes number:3 to 35)

    Is there any limitation such as time interval between consecutive jobs to assure flawless HPC Scheduler work?I would like to know what happens when a few jobs connects to HPC Scheduler at nearly the same time (so delay between two jobs is as close to 0ms as possible).

    Daniel Drypczewski
    Monday, December 5, 2011 5:56 AM