locked
How to run a job for a specific program RRS feed

  • Question

  • Hi, I'm new to HPC and I can't figure out how to run a distributed job. Let's say that I have a program running on 1 machine and it takes a long time to finish processing. I need to use HPC to reduce the time for processing. At the moment, I installed HPC for testing with 1 head node, 1 compute node and 1 client computer to create and submit the job. I tried to create and run a job in Job Manager but they all failed. Should I install the program which I want to be distributed in the client computer where I create and submit the job or in the head node? What is a command for doing this in creating a task? I really appreciate if anyone can tell me some basic steps to create a job in the Job Manager and make it work.  

    Thursday, November 25, 2010 6:15 AM

Answers

  • The HPC Pack software will allow you to distribute jobs (run apps) on multiple computers across your cluster. This application needs to either be installed on each machine (compute node) or support running from a file share. The HPC Pack software does not handle application deployment natively.

    For more information on the system management features of the HPC Pack please see http://technet.microsoft.com/en-us/library/ee783547(WS.10).aspx

    Wednesday, January 12, 2011 2:44 AM
    Moderator

All replies

  •  

    I have recently started using a HPS as well.

    For earliest playing around I just used the Job Scheduler to run a Sweep and replaced where it says

    Task.exe    with 

    echo %computername% *

    after that I moved onto simple batch scripts.

    I put my batch scripts/executables on a local disk on the HPC ( a raid array attached to the head node and shared to the compute nodes)

     

    Hope this helps

    Thursday, November 25, 2010 6:25 AM
  • Hi Steven,

    You just need to make the executable file accessible on all your computenodes. This means that you can:

    1. Deploy your 'program.exe' (and all related files like dlls) to all your computenodes in to the same path (like c:\program1_workdir). Then you can use this common path as task's workdir parameter.

    To deploy your application you can use 'clusrun' command: http://technet.microsoft.com/en-us/library/cc947685(WS.10).aspx
    You can also utilize node preparation tasks mechanism: http://technet.microsoft.com/en-us/library/ee783543(WS.10).aspx

    2. Start your application directly from a network share. This share can be located on your headnode, so you will just need to deploy your application in there and run task with commandline like this: \\MyHeadnode\app_share\myapp.exe

    Regards,
    Łukasz

    Monday, November 29, 2010 3:25 PM
  • Hi,

    Thanks for all the replies. I tried the option 2 (Starting the application directly from a network share). However it fails.

    The situation is that I need to run an application, which is a processing worker for another application, on the cluster. In other words, the controller (or the manager) from other machine will send jobs to the worker for processing. Since it takes time for the worker to finish the job so I want to use cluster to distribute in processing. I installed the worker on the head node and tried to create a non-stopped job with 1 task like \\MyHeadnode\app_share\myapp.exe (non-stopped job because the worker doesn't know when the controller will send jobs to it). I did copy the executable file and all related dlls to the public folder. DId I do it right?

    In the node management tab, I can see CPU usage for the head not is 100% but for the computing node is nearly 0%. The Heat Map tells me that the head node is processing but the computing node is idle. In the Job Management tab, I got a fail msg for the task with an exit code. I just don't know what I did wrong or how to tell if I did something right. I read the manual and the help but it doesn't make sense in my case.

    Regards,

       

    Thursday, December 2, 2010 1:56 AM
  • Hi Steven,

    If you need to run one application, which depends on the output of another application, you may consider running them both as tasks of a single job and setting correct task dependencies for them. Here is a description of how to do that from the GUI: http://technet.microsoft.com/en-us/library/cc972816(WS.10).aspx

    There can be many reasons for your tasks failing while starting an app from a remote location. Here are a few possibilities:
     - allocated node doesn't have access to a network location,
     - user account, under which the job is runninng, doesn't have appropriate privileges,
     - in some cases application or script may not be trusted to run from a network share,
     - application itself have some problems with running correctly from the share.

    I suggest trying to run your application manually on the node where it originally failed using an executable from a network share. Please also share all error messages that you are getting.

    Regards,
    Lukasz

    Thursday, December 2, 2010 8:34 PM
  • Hi Lukasz,

    Thanks for your suggestion, I tried to run the application manually from the network share and it's working. About the user account, i used an account belonging to Admin groups so I suppose it should be ok. Do you know any way to double check? 

    For dependent tasks, I just don't think this is my case. The worker's task doesn't run after the controller's task. The scenario is that the user interacts with the controller, when the user executes a job from the controller's GUI, the controller breaks the job into different tasks and assigns them to different workers. Each worker will process the assigned task and return output to SQL server.

    As I know, tasks in HPC server are sequential and they are run right after summit. My issue is that the worker has nothing to do until the controller assign job to it. The controller has nothing to do until the user asks. What I want is that the worker just sits on the cluster and waits for a job. When it is assigned a job, it uses resources from different nodes to process the job in order to improve the performace.

    Do you know if we can do that with HPC server?

    Below is what I got from error file:

    log4net:ERROR [RollingFileAppender] Unable to acquire lock on file C:\ProgramData\Application Data\IPRO Tech\eCapture\Worker\Worker-log4net.log. The process cannot access the file 'C:\ProgramData\Application Data\IPRO Tech\eCapture\Worker\Worker-log4net.log' because it is being used by another process.

    Unhandled Exception: System.InvalidOperationException: Showing a modal dialog box or form when the application is not running in UserInteractive mode is not a valid operation. Specify the ServiceNotification or DefaultDesktopOnly style to display a notification from a service application.
       at System.Windows.Forms.MessageBox.ShowCore(IWin32Window owner, String text, String caption, MessageBoxButtons buttons, MessageBoxIcon icon, MessageBoxDefaultButton defaultButton, MessageBoxOptions options, Boolean showHelp)
       at System.Windows.Forms.MessageBox.Show(IWin32Window owner, String text, String caption, MessageBoxButtons buttons, MessageBoxIcon icon, MessageBoxDefaultButton defaultButton, MessageBoxOptions options)
       at Microsoft.VisualBasic.Interaction.MsgBox(Object Prompt, MsgBoxStyle Buttons, Object Title)
       at Worker.WorkerApplication.Main(String[] arguments)

    Cheers,

    Steven

     

    Friday, December 3, 2010 6:34 AM
  • As I know, tasks in HPC server are sequential and they are run right after summit. My issue is that the worker has nothing to do until the controller assign job to it. The controller has nothing to do until the user asks. What I want is that the worker just sits on the cluster and waits for a job. When it is assigned a job, it uses resources from different nodes to process the job in order to improve the performace.

    By worker sitting on the cluster do you mean your worker program is running all the time and just waiting for an input? Or is it, that nodes are prepared to run your worker program and the job to start them is submitted every time when there is a request from a controller?

    For the second case I think you can automate a process of creating a job and submitting it for every request by using one of the following options:

    • scripting (Powershell or CLI),
    • Scheduler API (.NET or COM)

    About the error message you are getting, it looks like your application cannot access a log file located in C:\ProgramData\Application Data\IPRO Tech\eCapture\Worker\Worker-log4net.log because it's already in use. Are you running multiple instances of your app on this single node? Maybe you need to use different log file for each of them?

    Application is also complaining about not being able to open a modal dialog box, because it's not running in interactive mode. If it is a requirement for your application to run in interactive session you may find the following article useful: http://technet.microsoft.com/en-us/library/gg315415(WS.10).aspx

    Thanks,
    Łukasz

    Tuesday, December 7, 2010 5:54 PM
  • By worker sitting on the cluster, I mean the worker program is running all the time and just waiting for an input from the controller.

    I installed the worker on the head node, copied all files ,which are in the same folder with the executable file of the worker, to a shared folder in the head node (C:\public). Then I set the working directory to \\headnode\public and run a job to execute the worker. Is that the right thing to do for my purpose?

    Tuesday, December 7, 2010 10:33 PM
  • I believe, that this is one of the correct approaches. However if this will work, depends heavily on your worker application specifics and its support for such scenario.

    For example installation process on the headnode may not be limited to exe and dll files only. It may produce some registry entries or machine specific configuration files required by the app for correct functioning. If this is a case, app running on another machine (computenode) when launched from headnode share will be missing all this elements. In such situation it's rather necessary to install this application separatelly on each of the computenodes.

    Thursday, December 9, 2010 2:39 PM
  • Hi Lukasz,

    Thanks for you suggestion, I understand that some registry entries or machine specific configuration files may required for running the application. However, I thought that Microsoft HPC solution will help me to overcome that issue. The point is I only can run one instance of the application so I want to run it on a "super computer" which is formed by available resources in the cluster. Can you please confirm me if HPC Server 2008 can do that? If not, is there anyway to accomplish that purporse?

    Thank you very much,

    Regards,

    Steven

    Monday, December 13, 2010 10:45 PM
  • Hi,

    Anyone has the same issue and has found the answer for it. I would appreciate if you can share with me.

    Cheers,

    Monday, December 20, 2010 11:06 PM
  • The HPC Pack software will allow you to distribute jobs (run apps) on multiple computers across your cluster. This application needs to either be installed on each machine (compute node) or support running from a file share. The HPC Pack software does not handle application deployment natively.

    For more information on the system management features of the HPC Pack please see http://technet.microsoft.com/en-us/library/ee783547(WS.10).aspx

    Wednesday, January 12, 2011 2:44 AM
    Moderator