none
Expiration of logon cache cause jobs to fail

    Question

  • Hello,

    I have Windows HPC Server 2008 R2 based cluster and the following situation. A couple of months ago user submitted some jobs to cluster and since that time he didn't login to cluster anymore. Today I've found that his job become failed without any actual reasons, the error is "Job failed to start on some nodes or some nodes became unreachable." The activity log of the job is:

    16.11.2010 15:35:43 Created by domain\user 
    16.11.2010 15:35:43 Submitted
    16.11.2010 15:36:05 Started
    16.11.2010 15:36:05 Started on BHINOVCL2N3 with 1 cores
    12.01.2011 18:53:58 Ended on BHINOVCL2N3
    12.01.2011 18:53:58 Job Canceled
    12.01.2011 18:54:76 Started
    12.01.2011 18:54:16 Started on BHINOVCl2N4 with 1 cores
    12.01.2011 18:54:18 Ended on BHINOVCl2N4
    12.01.2011 18:54:18 Job Canceled
    12.01.2011 18:54:19 Started
    12.01.2011 18:54:19 Started on BHINOVCl2N4 with 1 cores 12.01.2011 18:54:19 Ended on BHINOVCl2N4
    12.01.2011 18:54:19 Job Canceled
    12.01.2011 18:54:21 Started
    12.01.2011 18:54:21 Started on BHINOVCl2N3 with 1 cores 12.01.2011 18:54:21 Ended on BHINOVCl2N3
    12.01.2011 18:54:21 Job Failed

    [By the way, why it is impossible to copy Job Activity Log, I had to copy it as picture and OCR it]

    So, when I looked at the Event Log of node BHINOVCL2N3 I found this:

    Log Name:      System
    Source:        LsaSrv
    Date:          1/12/2011 6:53:56 PM
    Event ID:      45058
    Task Category: Logon Cache
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      BHINOVCL2N3.domain
    Description:
    A logon cache entry for user user@domain was the oldest entry and was removed. The timestamp of this entry was 11/17/2010 13:09:39.

    Is appears to me that as soon as logon cache of the user was removed, the job became failed. The question is how to avoid such accidents in the future? Can I set somewhere the time for logon cache expiration?

    Thursday, January 13, 2011 6:58 AM

All replies

  • Hi Nikita,

    This issue appears to be related to the number of users logging on to the head node as well as the cached login settings in your environment.  I would suggest you open a case with our directory services team.  It appears this is configurable both locally and via group policy, but we'd want a directory services engineer to discuss with you the number of users logging on and the policies currently being pushed out.

    Kevin

    Wednesday, February 16, 2011 6:25 PM