2011년 1월 13일 목요일 오전 6:58
I have Windows HPC Server 2008 R2 based cluster and the following situation. A couple of months ago user submitted some jobs to cluster and since that time he didn't login to cluster anymore. Today I've found that his job become failed without any actual reasons, the error is "Job failed to start on some nodes or some nodes became unreachable." The activity log of the job is:
16.11.2010 15:35:43 Created by domain\user 16.11.2010 15:35:43 Submitted 16.11.2010 15:36:05 Started 16.11.2010 15:36:05 Started on BHINOVCL2N3 with 1 cores 12.01.2011 18:53:58 Ended on BHINOVCL2N3 12.01.2011 18:53:58 Job Canceled 12.01.2011 18:54:76 Started 12.01.2011 18:54:16 Started on BHINOVCl2N4 with 1 cores 12.01.2011 18:54:18 Ended on BHINOVCl2N4 12.01.2011 18:54:18 Job Canceled 12.01.2011 18:54:19 Started 12.01.2011 18:54:19 Started on BHINOVCl2N4 with 1 cores 12.01.2011 18:54:19 Ended on BHINOVCl2N4 12.01.2011 18:54:19 Job Canceled 12.01.2011 18:54:21 Started 12.01.2011 18:54:21 Started on BHINOVCl2N3 with 1 cores 12.01.2011 18:54:21 Ended on BHINOVCl2N3 12.01.2011 18:54:21 Job Failed
[By the way, why it is impossible to copy Job Activity Log, I had to copy it as picture and OCR it]
So, when I looked at the Event Log of node BHINOVCL2N3 I found this:
Log Name: System
Date: 1/12/2011 6:53:56 PM
Event ID: 45058
Task Category: Logon Cache
A logon cache entry for user user@domain was the oldest entry and was removed. The timestamp of this entry was 11/17/2010 13:09:39.
Is appears to me that as soon as logon cache of the user was removed, the job became failed. The question is how to avoid such accidents in the future? Can I set somewhere the time for logon cache expiration?
2011년 2월 16일 수요일 오후 6:25
This issue appears to be related to the number of users logging on to the head node as well as the cached login settings in your environment. I would suggest you open a case with our directory services team. It appears this is configurable both locally and via group policy, but we'd want a directory services engineer to discuss with you the number of users logging on and the policies currently being pushed out.