locked
Node with Multiple GPU's Not Functioning RRS feed

  • Question

  • We have one compute node with 1 socket, 4 cores, 8 logical processors and 3 NVIDIA GeForce GTX 760 GPU's. Our application is a c#.net application that imports a C++ dll that uses AMP/CUDA to take advantage of GPU for number crunching. Our application runs fine when launching from the command line on this node. We are also able to launch 3 instances and use 3 GPU's in parallel from the command line. Our application also runs fine if we remove two of the GPU's (so only one GPU installed) and launch an HPC job with Gpu resource type. When all 3 GPU's are installed and we launch a job with gpu resouce type the job stays in the running state forever. I read on this forum that you need to set the resource type for the job to sockets when the node has multiple GPU's but our job stays in the running state forever as well. For us to go forward with using MS HPC we need to be able to run HPC jobs accessing multiple GPU's on each node.

    Any help with how to get this running would be appreciated.

    Dave Richards

    Wednesday, December 16, 2015 9:57 PM

All replies

  • Hi Dave,

      Which version of HPC Pack are you using now? I suppose you mentioned GPU resource type, you should be already using HPC Pack 2012 R2 Update 3. And yet you've read this article https://technet.microsoft.com/library/mt595856.aspx?

      And, could you tell me where did you get this info? I read on this forum that you need to set the resource type for the job to sockets when the node has multiple GPU's. If possible, could you share the job XML that's in the running state for ever (Through hpcpack@microsoft.com)


    Qiufang Shi

    Thursday, December 17, 2015 2:49 AM
  • Hi Qiufang,

    I am using Update 3. I have read article https://technet.microsoft.com/library/mt595856.aspx. I am able to get the index of the available node that HPC is using by calling the "CCP_GPUIDS" environment variable. If I set up 3 tasks I can see that each task returns index of 0, 1, and 2. Still stays in running state forever.

    Here is the forum link regarding sockets resource type:

    https://social.microsoft.com/Forums/en-US/7bb5fd7c-48d3-4db6-9a5f-52275d308111/node-configuration-single-server-multiple-k80-cards?forum=windowshpcitpros

    I will send you job.xml via email at hpcpack@microsoft.com next. My email is drichards@kcc.us.com

    Thanks for your reply.

    Look forward to hearing from you.

    Dave

    Thursday, December 17, 2015 12:39 PM
  • Ok. Just sent job xml to hpcpack@microsoft.com. Above I meant to say I am able to get index of the available GPU, not node.

    Thanks.

    Thursday, December 17, 2015 1:00 PM
  • Hi Qiufang,

    Just realized we are getting error returned from HPC Job Task when job was cancelled. The error we get when our code is trying to trying to access GPU:

    Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

    Since we can run our application successfully in on this machine using command line window on all three GPU's and we can also run successfully via HPC when there is only one GPU (we removed two of three GPU for this test) and after reading the following article:

    https://technet.microsoft.com/en-us/library/gg247477%28v=ws.10%29.aspx

    We thought we may need to run HPC job in console mode so we made updates according to above article. We now see the following error from HPC Job Task:

    Error from node: DTHPC004:Microsoft.Hpc.Activation.NodeManagerException: HPC_CREATECONSOLE set to true but console is in use
       at Microsoft.Hpc.NodeManager.RemotingExecutor.JobEntry.LogonToConsole(SecureString password)
       at Microsoft.Hpc.NodeManager.RemotingExecutor.JobEntry.Init(String userAccount, SecureString password, Byte[] certificate, CreateConsoleConnection createConsole, ConsoleConnection connectConsole, SessionConnection connectSession)
       at Microsoft.Hpc.NodeManager.RemotingExecutor.JobEntryFactory.GetJobEntry(Int32 jobId, String userAccount, Cipher cipher, Byte[] cipherText, Byte[] iv, Byte[] certificate, CreateConsoleConnection createConnection, ConsoleConnection connectConsole, SessionConnection connectSession)
       at Microsoft.Hpc.NodeManager.RemotingExecutor.RemotingNMExecImpl.StartJob(Int32 jobId, String userAccount, Byte[] cipherText, Byte[] iv, Byte[] certificate, CreateConsoleConnection createConnection, ConsoleConnection connectConsole, SessionConnection connectSession)
       at Microsoft.Hpc.NodeManager.RemotingExecutor.RemotingNMExecImpl.StartJobAndTask(Int32 jobId, String userAccount, Byte[] cipherText, Byte[] iv, Byte[] certificate, Int32 taskId, ProcessStartInfo startInfo)

    We confirmed that there are no users logged on this machine so not sure why this is reporting that console is in use. I will send you via email the Job xml for this failed job.

    Any thoughts would be appreciated.

    Dave Richards

    Thursday, December 17, 2015 11:17 PM
  • The error indicates that there is already console session on the compute node thus failed to create one. You can try:

    1. Log out any sessions from the node, or

    2. Instead of using HPC_CREATECONSOLE, use HPC_ATTACHTOCONSOLE=TRUE which will run the job under existing console.


    Qiufang Shi

    Monday, December 21, 2015 7:24 AM