How do I diagnose Node configuration problems? RRS feed

  • Question

  • I have a new cluster where currently only the head node has my service application files installed. Jobs run fine on that single node, but show failures (State: Failed) for task Ids representing the other nodes in the cluster. I fully expect to see errors in this situation (I'm specifically testing for this condition), but I'm not sure how I can get the detail of the error. I was expecting something like "FileNotFound: C:\...\MyService.dll", but instead all I get in the Cluster Manager "View Job" screen is:

    The task is running on a node which is no longer usable by the task's job.  This could happen because the nodegroups have been changed in the cluster, or because the node has been added to the job's node exclusion list.

    Where could I expect to see more detail on the actual problem (ie. "You didn't install the service yet, dummy!")?

    I _am_ going to get calls from my customers about nodes they haven't configured correctly.

    Friday, August 26, 2011 2:27 PM

All replies

  • You need to correctly deploy your service.

    1. Put your "myservicename.dll" file on each compute node in cluster.

    2. Put your service config file "myservicename.config" on each compute and broker node on folder c:\program files\microsoft hpc pack 2008 r2\serviceregistration\ and share it.

    Don't forget to change .config file to set a path where .dll is placed.

    Also check host files for "Private.*" entries.

    I hope, this would help you

    Monday, October 3, 2011 2:18 PM
  • Yes, _I_ understand all that, but if my customer has 1,000 nodes and one of them isn't configured correctly because one of their sysadmins didn't drink enough coffee before installing my software on 999 of those nodes, the diagnostics aren't helping him (and by extension, me, usually at 2am on the weekend) to figure out that he's missing some files.

    • Edited by wbradney Monday, October 17, 2011 8:49 PM
    Monday, October 17, 2011 8:43 PM
  • If i understand you correctly, to diagnose SOA service your could run Diagnostic "SOA Service loading test' in HPC Manager and configure it to check your service by writing your service name in "Configure Test Parameters". So you would see an error on incorrect configured nodes.

    Or if you would like to check some files existance, you could write simple command like "clusrun if not exist MyFilePath (echo Error)".


    Tuesday, October 18, 2011 9:44 AM
  • My customer's admin guy would love to be able to run that diagnotic, but he can't: http://social.microsoft.com/Forums/en/windowshpcdevs/thread/6f1384f9-fab0-4544-90ea-85a8ffb87331


    Wednesday, October 19, 2011 6:55 PM