none
CCS scheduler delay (and some other problems) RRS feed

  • Question


  • Hello,

      We are setting a Windows cluster using IBM server/IBM blades, with Voltaire InfiniBand
    interconnection.  The system seems working fine, but some further tests show some
    problems.  (Each node has 8 CPU cores, and is connnected with one InfiniBand and
    one gigabit ethernet. The AD is running on an dedicated server and is not part
    of the CCS.)

    1. When we are running a job with tasks number smaller than 8, things work well.
    (Though the job would stay "Queued" for about less than 1 minutes before "Running".)
    However, if the number of tasks is larger than 8, therefore need more than 1 node,
    it would become very slow.  According to the Compute Cluster Job Manager, tasks
    would start one by one, with intervals almost exactly 30 seconds. (If the number of tasks
    is < 8, all tasks would start at the same time.)  So it would take minutes between
    the time a job started and the time all tasks of the job are started.  This would
    cause problem for some applications (such as gridMathematica, which has a 10 second timeout.)

    2. The "clusrun" command also shows some delay as well.  Even if I specified only one
    node with "/nodes:" option, I would have to wait for about 30 seconds (again).  If
    I chose more node, the delay would go proportionally.  In other word, if I want
    to use "clusrun" to issue a command to all nodes of the cluster (which has 14 nodes
    right now, and is expected to expand to 58 nodes very soon,) it would take
    a very long time.

    3. During the setup of the system, we had met a problem frequently.  The systems
    (either AD server, or the scheduler server) would randomly reboot due to
    some Application error of lsass.exe and mswsock.dll.  (It usually happened
    1 to 2 times each day.)  And we found that removing and re-install the device driver
    of the InfiniBand may sometimes fix the problem.  However, the problem seems to
    re-surface once a while.

      Did anybody have similar problems, and perhaps some solutions?

    Any help would be appreciated.

    Ting-jen


    Monday, April 21, 2008 3:34 AM

Answers

  • Hello Ting-jen,

     

    I suspect the crashing problem was due to the WinsockDirect (WSD) provider being installed/enabled.  I have seen bugs in a WSD provider take down lsass and other critical system processes before.  It would be worthwhile letting Voltaire know about this issue - perhaps they have a newer release?

     

    Also, with WSD enabled on some nodes but not others (e.g. only the head node and not the compute nodes), you will get longer connection establishment times as the head node will try the WSD path first, and when that times out will try the standard TCP/IP path.

     

    -Fab

    Tuesday, April 29, 2008 4:00 PM

All replies

  •  

    Can you please provide some more details on your configuration?  Namely:

    • What version of CCS are you running (Admin Console -> Help -> About and give us the build number)?
    • What OS and SQL versions?
    • What Infiniband HCA's and what Infiniband Driver versions?

    Thanks,
    Josh

    Tuesday, April 22, 2008 5:57 PM

  •   The version number is 1.0.0676.14 (from Compute Cluster Management Console).
      And the OS is Microsoft Windows Server 2003 R2 Enterprise x64 edition for server, and
    compute cluster edition for computing nodes.
      InfiniBand HCA is Voltaire InfiniBand HCA for PCI Express (MT25208) according to the Device Manager.
      And the Infiniband Driver version is 2.5.615.1011

      Actually, I have managed to fix the problem for now by removing then re-install the same InfiniBand driver on all compute nodes, though I still wonder why it works.  It seems to me that the driver does not work very stable on Windows.  We have another Linux cluster with exactly the same hardware, but it works quite well.  Has anybody else had problems like this?

    Thanks,

    Ting-jen
    Wednesday, April 23, 2008 12:56 AM
  • Hello Ting-jen,

     

    I suspect the crashing problem was due to the WinsockDirect (WSD) provider being installed/enabled.  I have seen bugs in a WSD provider take down lsass and other critical system processes before.  It would be worthwhile letting Voltaire know about this issue - perhaps they have a newer release?

     

    Also, with WSD enabled on some nodes but not others (e.g. only the head node and not the compute nodes), you will get longer connection establishment times as the head node will try the WSD path first, and when that times out will try the standard TCP/IP path.

     

    -Fab

    Tuesday, April 29, 2008 4:00 PM
  • Thanks for your information.
      Since I am not very familiar with Windows system, I cannot say I understand it very well.  However, we did contact Voltaire about this issue, and their reply simply said that their R&D think this is AD configuration error.
      Now, as I said, I am not very familiar with Windows system administration.  So I hope somebody could tell me where I might change AD configuration.  Because while it seems working fine right now, any modification of the infiniband network configuration on some of the machines (even some other Linux machine shared the same InfiniBand switch) still triggers system errors about lsass.exe on either the AD or the scheduler server sometimes.

    -- Ting-jen

     
    Monday, May 5, 2008 1:18 AM