locked
Tasks stuck on Dispatching or immediately failing. RRS feed

  • Question

  • VERY URGENT REQUEST HERE!  I am using HPC Pack 2016, which has been working fantastic for months. I just finished up a large DoE Friday afternoon, and within an hour of that started another one, but now nothing is working! I can't think of anything that changed in this environment, all I know is that tasks are now doing one of four things:

    1.Getting stuck at "Dispatching"

    2.Immediately failing with "Error from node: <node_name>:System.ObjectDisposedException: Cannot access a disposed object. Object name: 'System.ServiceModel.Channels.ServiceChannel'"

    3.Immediately failing with "Error from node: <node_name>:System.ObjectDisposedException: Cannot access a disposed object. Object name: 'WcfReliableClient'. at Microsoft.Hpc.WcfReliableClient`1.<GetWcfProxyAsync>d__10.MoveNext()"

    4.Eventually failing with "Error from node: <node_name>:System.Runtime.Serialization.SerializationException: Unable to find assembly 'Microsoft.Hpc.NodeManager.RemotingExecutor, Version=5.0.0.0, Culture=neutral, PublicKeyToken=null'."

    I have restarted all computers involved, including the head node.  I have uninstalled and reinstalled the HPC Pack components on a Workstation Node and targeted that node exclusively with a new test job, same failure.

    I would appreciate any ideas.  I have lots of pressure to get this environment functional again and these studies complete!

    Thanks!
    -MattL

    Saturday, May 6, 2017 8:10 PM

Answers

  • Hi Matt,

    this is the issue of HPC 2016, we don’t show correct error messages. We will fix it in next release.

    When you see this exception, the cause are as the following:

    1, working directory does not exist or cannot access

    2, unable to open standard input file on compute node


    • Marked as answer by MattManDL Monday, May 8, 2017 1:47 PM
    • Unmarked as answer by MattManDL Monday, May 8, 2017 2:37 PM
    • Marked as answer by MattManDL Friday, May 12, 2017 1:51 PM
    Monday, May 8, 2017 9:34 AM

All replies

  • I'm not 100% sure what fixed it, but after re-running the HpcServer_x64.msi installer on the headnode, and lots of reboots of everything, it is now working. That file is located in the HPC Pack 2016 install media, setup folder.
    • Marked as answer by MattManDL Monday, May 8, 2017 1:06 AM
    • Unmarked as answer by MattManDL Monday, May 8, 2017 1:47 PM
    Monday, May 8, 2017 1:06 AM
  • Hi Matt,

    this is the issue of HPC 2016, we don’t show correct error messages. We will fix it in next release.

    When you see this exception, the cause are as the following:

    1, working directory does not exist or cannot access

    2, unable to open standard input file on compute node


    • Marked as answer by MattManDL Monday, May 8, 2017 1:47 PM
    • Unmarked as answer by MattManDL Monday, May 8, 2017 2:37 PM
    • Marked as answer by MattManDL Friday, May 12, 2017 1:51 PM
    Monday, May 8, 2017 9:34 AM
  • Thank you, Yongjun!  That makes much more sense.  I had created a new batch of jobs, and inadvertently FORGOT to set the working directory and Standard Output/Error paths, as I normally do.  Those jobs are working, whereas the jobs where I HAD set the Standard Output/Error paths are the ones that failed.

    Monday, May 8, 2017 1:47 PM
  • Any guesses why this is just now happening?

    All jobs in the past, I have set a UNC path in the Working Directory, and all have worked fine.  But now, as of Friday afternoon, I see this in the output:

    "CMD.EXE was started with the above path as the current directory.
    UNC paths are not supported.  Defaulting to Windows directory."

    And then numerous steps inside our process fails due to the working directory being different than expected. I honestly didn't even realize cmd.exe was involved in any of this, and wondering if I inadvertently changed something recently that brings cmd.exe into the mix?

    Thanks!
    -Matt

    Monday, May 8, 2017 2:37 PM
  • Hi Matt,

    Do you add new compute nodes in HPC cluster, and whether all compute nodes can access UNC path?

    Tuesday, May 9, 2017 1:33 AM
  • We have been continually adding new workstation nodes, but no new compute nodes for quite a few months now.  All nodes are on the same LAN, and can access the UNC path.  The account I submit these jobs with also has full access to the UNC path.

    I'm trying to track down what is different on the systems that fail with the "CMD.EXE was started with the above path as the current directory. UNC paths are not supported.  Defaulting to Windows directory." message.  I have a number of computers that all were built at the same time, Windows 10 b1607, no other software installed other than .Net 4.6, C++ 2010 SP1 Runtime, HPC Pack 2016 workstation node components.  In that batch of computers, we have about 10% that fail with the above message about UNC Paths not being supported.

    Is there a way to just exclude cmd.exe from being involved in this?  The Jobs we create have hundreds, possibly thousands, of "Basic Tasks".  Those tasks just call a .exe file we have sitting on a file share, with a couple parameters. We typically set the Working Directory in the Basic Task to be the UNC path to the parent folder of the .exe, just to make sure all dependencies and temp files are in a known location.

    For now, I have created a Node Prep task that sets up a Working Directory in C:\Windows\Temp, and then using that new Working Directory in all the Basic Tasks.  That will work for now, just curious why this suddenly became an issue.

    Wednesday, May 10, 2017 2:24 PM
  • Can you try logon to those failed workstation node with the same user, and try access UNC path to see whether it has permission.

    This should be related system settings I think

    Thursday, May 11, 2017 1:05 AM
  • It is not a permissions issue.  I can logon to all nodes, with the same user and access the UNC path.  For now, using C:\Windows\Temp as working directory is taking care of the issue.  Just had to setup a Node Prep and Node Release task to stage and cleanup any necessary files in the working directory.

    Thanks for your responses!

    -Matt

    Friday, May 12, 2017 1:51 PM