Hello,
I have Windows HPC Server 2008 R2 based cluster and the following situation. A couple of months ago user submitted some jobs to cluster and since that time he didn't login to cluster anymore. Today I've found that his job become failed without any actual
reasons, the error is "Job failed to start on some nodes or some nodes became unreachable." The activity log of the job is:
16.11.2010 15:35:43 Created by domain\user
16.11.2010 15:35:43 Submitted
16.11.2010 15:36:05 Started
16.11.2010 15:36:05 Started on BHINOVCL2N3 with 1 cores
12.01.2011 18:53:58 Ended on BHINOVCL2N3
12.01.2011 18:53:58 Job Canceled
12.01.2011 18:54:76 Started
12.01.2011 18:54:16 Started on BHINOVCl2N4 with 1 cores
12.01.2011 18:54:18 Ended on BHINOVCl2N4
12.01.2011 18:54:18 Job Canceled
12.01.2011 18:54:19 Started
12.01.2011 18:54:19 Started on BHINOVCl2N4 with 1 cores 12.01.2011 18:54:19 Ended on BHINOVCl2N4
12.01.2011 18:54:19 Job Canceled
12.01.2011 18:54:21 Started
12.01.2011 18:54:21 Started on BHINOVCl2N3 with 1 cores 12.01.2011 18:54:21 Ended on BHINOVCl2N3
12.01.2011 18:54:21 Job Failed
[By the way, why it is impossible to copy Job Activity Log, I had to copy it as picture and OCR it]
So, when I looked at the Event Log of node BHINOVCL2N3 I found this:
Log Name: System
Source: LsaSrv
Date: 1/12/2011 6:53:56 PM
Event ID: 45058
Task Category: Logon Cache
Level: Information
Keywords: Classic
User: N/A
Computer: BHINOVCL2N3.domain
Description:
A logon cache entry for user user@domain was the oldest entry and was removed. The timestamp of this entry was 11/17/2010 13:09:39.
Is appears to me that as soon as logon cache of the user was removed, the job became failed. The question is how to avoid such accidents in the future? Can I set somewhere the time for logon cache expiration?