none
Windows 2003 CCS - Jobs failing when running multiple jobs in parallel

    Question

  • I have a small W2003 CCS deployment - one head node, 4 compute nodes. We purchased it as a turnkey system from a hardware vendor in 2007, and it has been running fine until about 10 days ago. After attempting to install and run some new application software, all of a sudden almost all of our existing batch jobs began failing, even jobs that we know ran previously.

    Uninstalling the software did not resolve the problem, in fact it seemed to get worse.

    The symptoms were: if there was only one job running on the cluster, it would complete normally. If there was only one compute node running (the other three paused), the jobs would queue up normally, and run one after another. However, if I resumed another node, the jobs that were running would fail, and all subsequent jobs would fail as well.

    It's a long story, but I wound up rebuilding the cluster from scratch using the installation CDs in an attempt to get things working the way they did before. We have "Scenario 3" - compute nodes on a private network using ICS on the head node, an MPI network that connects the head node and the compute nodes, and a NIC pointing to our company network on the head node.

    All connections use Gigabit ethernet switches - a switch for the MPI, a switch for the private network, and the head node's public NIC is plugged into a gigabit ethernet switch.

    All machines have the bare essentials - W2003 CCS, CCP with SP1, the minimum hotfixes supplied by the vendor, and Adobe Acrobat reader.

    Windows Updates is turned off for right now.

    Windows Firewall is turned off for right now.

    As it stands, I seem to have the cluster working - the head node is up, the compute nodes are seen and appear in Cluster Admin. I can not ping or access the compute nodes from the public network, but I can remote desktop to the head node, and remote desktop to the compute nodes from there.

    All of the compute nodes can see the outside world, including AD for authentication. All of the machines have been added to AD.

    The only app installed on the cluster is LS-Dyna, which did not require a Windows install - I just copied the executables and license file over to the compute nodes, and set some system-level environment variables to point to the license file.

    However, the same problem occurs: if I pause all the compute nodes and submit multiple jobs, then resume one node (doesn't matter which one), everything is fine - jobs are submitted sequentially and complete. However, if I resume a second compute node, jobs start to run on the second node, but the running jobs might/might not finish, and all subsequent jobs in the queue fail, but not always at the same point.

    The jobs we run consist of a template file which submits a single batch file to the scheduler. The batch file looks at environment variables, and adds tasks to the job to: create a job-specific folder in a work directory on the head node; copy the input files to it from the user's folder; run Dyna (we have templates for running the MS MPI, the vendor's MPI, or a non-MPI (SMP) version); post-process some of the files after Dyna completes, then move everything back to the user's originating folder.

    The jobs might fail at any of the steps mentioned above. MS-MPI, vendor MPI and SMP jobs all fail. I have run the jobs requiring exclusive allocation of a node, and also requesting specific nodes, but the problem persists. Also the jobs are small enough to fit on a single node (4 CPUs). None of them actually require an MPI.

    Any ideas of what could be wrong?

    Also, I had some questions on configuring the hardware. I was not involved in some of the original settings, so I am trying to understand if these could be causing the problem (or are problems in general):

    The private network is configured as 192.168.0.x / 255.255.255.0; ICS set DHCP and DNS servers for me, default gateway is 192.168.0.1, of course.

    The MPI network was manually configured with static IP addresses of 192.168.19.x /255.255.255.0. The head node MPI NIC has no default gateway; the compute node MPI NICs have the the head node as their default gateway, and all of the MPI NICs have our corporate DNS servers as their DNS servers. Is this correct? Since the MPI network is only for the cluster, it seems like a default gateway and DNS servers on these NICs is unnecessary, but shouldn't harm anything.

    However, I did notice that the compute nodes appear in the corporate DNS with the MPI network addresses (192.168.19.x), rather than the shared IP address of the Head node (which I would have thought would be the case for NATed workstations). This was present before I rebuilt the cluster, but I don't know if a static entry was put into DNS or not.

    Also, in the course of trying to resolve this, I discovered that our IT department had hooked up the cluster's mass storage server to the MPI network, not the private network. I disabled the storage server's NIC on the 192.168.19.x network, and am having the compute nodes go through the ICS gateway to get to it for right now. This did not affect the problem, the jobs continued to fail even after I disabled the NIC, however, it seems like that's not a good idea, one way or the other. I tried to plug it into the private network switch, but apparently the open switch ports were disabled by our network group.

    Thanks in advance for any insights!

    • Edited by Drydocked Monday, January 25, 2010 1:34 AM Correct title
    Monday, January 25, 2010 1:25 AM

Answers

  • Hi Dan,

    Thanks so much for replying...Between my original post and yesterday afternoon, I was able to talk to some folks at Microsoft (Frank Chism and Mike Long), and they were able to help me out.

    You were exactly on-track with the storage server. The storage server actually had two failed drives, plus one in some sort of pending failure mode (red blinking LED). One of the failed drives came back after being re-seated, and the IT group replaced the other. The one with the blinking light is still there, however.

    Before I saw your reply, I had tried reconfiguring to eliminate the MPI network, no luck, problem still there. Also tried eliminating the MPI network by using the MPI NICs for the private network, also no luck. I also tried the pre-population of the data test, and eliminating the storage server, and that in fact, corrected the problem. However, I was focused on it being a network routing problem to the storage server, rather than an issue with the storage server not being able to keep up with the cluster.

    Frank and Mike brought up that issue, and subsequently I rewrote my scripts to eliminate any communication with the storage server during application execution, other than initially transferring data to the cluster, and transferring results back from the cluster. STDOUT and STDERR streams are going to files on the head node, where I created a share for the users to access the files. I ran over 150 jobs through the cluster this morning (short LS-DYNA jobs) without a hitch, the users are back running their analyses.

    So life is good, and I can go get some sleep. Just have to get that last drive replaced.

    Thanks again for taking the time to try to help, I really appreciate it.
    Wednesday, January 27, 2010 11:45 PM

All replies

  • Hi there
    Do you have any task output from your failed jobs? How about entries in the windows event logs?
    Just to confirm, some jobs fail during file copy, some fail during LS-Dyna execution? If you take the storage server out of the equation (i.e. pre populate working directory / remove file copy operations from submit script) do you see the same behaviour?
    Another worthwhile debug step would be directing MPI traffic over your private network, either by altering the network topology or by using CCP_MPI_NETMASK.
    Regards
    Dan
    P.S. It's been a while since I last debugged 2003 CCS but can't see much wrong with your config. 
    Tuesday, January 26, 2010 3:01 PM
  • Hi Dan,

    Thanks so much for replying...Between my original post and yesterday afternoon, I was able to talk to some folks at Microsoft (Frank Chism and Mike Long), and they were able to help me out.

    You were exactly on-track with the storage server. The storage server actually had two failed drives, plus one in some sort of pending failure mode (red blinking LED). One of the failed drives came back after being re-seated, and the IT group replaced the other. The one with the blinking light is still there, however.

    Before I saw your reply, I had tried reconfiguring to eliminate the MPI network, no luck, problem still there. Also tried eliminating the MPI network by using the MPI NICs for the private network, also no luck. I also tried the pre-population of the data test, and eliminating the storage server, and that in fact, corrected the problem. However, I was focused on it being a network routing problem to the storage server, rather than an issue with the storage server not being able to keep up with the cluster.

    Frank and Mike brought up that issue, and subsequently I rewrote my scripts to eliminate any communication with the storage server during application execution, other than initially transferring data to the cluster, and transferring results back from the cluster. STDOUT and STDERR streams are going to files on the head node, where I created a share for the users to access the files. I ran over 150 jobs through the cluster this morning (short LS-DYNA jobs) without a hitch, the users are back running their analyses.

    So life is good, and I can go get some sleep. Just have to get that last drive replaced.

    Thanks again for taking the time to try to help, I really appreciate it.
    Wednesday, January 27, 2010 11:45 PM
  • Hi
    That's excellent news, I'm glad you managed to get to the bottom of things! :)
    Dan
    Friday, January 29, 2010 8:53 AM