Asked by:
Cannot be deleted batch file (.cmd), which prepared for HPC running

Question
-
Hi,
Could you help me?
To run a job, we prepare batch file, place it on shared folder of Windows HPC R2 SP1 server and submit job with path on the batch file.
When our client program receives "job finished" event, we try to delete the file, but exception like "the file used by another process" raised.
We guess, if our client received "job finished" event, so, we may delete the file.
We checked and found, the file can be deleted after 10-30 seconds after "job finished" event receiving.
Seems (I am not sure), we found the problem after R2 SP1 installation only.
Thank you for any help,
Igor.
Wednesday, January 19, 2011 11:56 AM
All replies
-
Hi Igor,
I couldn't reproduce your issue. Could you give some more information like script/code fragments?
Did you consider using NodePrep/Release tasks for this scenario? (http://technet.microsoft.com/en-us/library/ee783543(WS.10).aspx)
Is there a chance, that when you are receiving 'job finished' event and trying to delete the file manually, there is already another job from the queue starting and using the same script file?
Thanks,
ŁukaszWednesday, January 19, 2011 8:33 PM -
ok.
I am going to prepare a sample and publish it.
Seems, it can take a several days.
thank you,
Igor.
Thursday, January 20, 2011 11:28 AM -
Hi Igor,
I couldn't reproduce your issue. Could you give some more information like script/code fragments?
Did you consider using NodePrep/Release tasks for this scenario? (http://technet.microsoft.com/en-us/library/ee783543(WS.10).aspx)
Is there a chance, that when you are receiving 'job finished' event and trying to delete the file manually, there is already another job from the queue starting and using the same script file?
Thanks,
Łukasz
Hi Lukazs,First of all, thank you link. We are going to investigate types of tasks.
So, we prepared the sample, which shows the problem.
Sample (main code see in method Sample_CannotBeDeletedBatchFile):
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading; using Microsoft.Hpc.Scheduler; using Microsoft.Hpc.Scheduler.Properties; using System.IO; namespace ComputationTeamDemo { class MsHPCDemo { const string SERVER_NAME = "<SERVER>"; const string USER_NAME = "<USER>"; const string USER_PASSWORD = "<PASSWORD>"; static ManualResetEvent manualEvent = new ManualResetEvent(false); static void JobStateCallback(object sender, IJobStateEventArg args) { Console.WriteLine("JobStateCallback: Job state is " + args.NewState); if (JobState.Canceled == args.NewState || JobState.Failed == args.NewState || JobState.Finished == args.NewState) { manualEvent.Set(); } else { /* try { IScheduler scheduler = (IScheduler)sender; ISchedulerJob job = scheduler.OpenJob(args.JobId); // TODO: Do something with the job } catch (Exception e) { Console.WriteLine(e.Message); } */ } } static void TaskStateCallback(object sender, ITaskStateEventArg args) { Console.WriteLine("TaskStateCallback: State for task {0} is {1}", args.TaskId, args.NewState); if (TaskState.Finished == args.NewState || TaskState.Failed == args.NewState) { try { IScheduler scheduler = (IScheduler)sender; ISchedulerJob job = scheduler.OpenJob(args.JobId); ISchedulerTask task = job.OpenTask(args.TaskId); Console.WriteLine("Output from task:\n" + task.Output); } catch (Exception e) { Console.WriteLine(e.Message); } } } static void Main(string[] args) { MsHPCDemo demo = new MsHPCDemo(); //demo.SimpleDirTask(); demo.Sample_CannotBeDeletedBatchFile(); } void SimpleDirTask() { using (IScheduler scheduler = new Scheduler()) { ISchedulerJob job = null; ISchedulerTask task = null; scheduler.Connect(SERVER_NAME); // Create a job and add a task to the job. job = scheduler.CreateJob(); task = job.CreateTask(); //task.CommandLine = @"dir d:\"; job.Name = "MsHPC Demo"; job.Project = "Seminar"; task.CommandLine = @"dir d:\"; task.Type = TaskType.ParametricSweep; task.StartValue = 1; task.EndValue = 100; task.IncrementValue = 1; job.AddTask(task); // Specify the events that you want to receive. job.OnJobState += JobStateCallback; job.OnTaskState += TaskStateCallback; // Start the job. scheduler.SubmitJob(job, USER_NAME, USER_PASSWORD); // Blocks so the events get delivered. One of your event // handlers need to set this event. manualEvent.WaitOne(); Console.Write("\nPress Enter to quit"); Console.ReadLine(); } } // // http://social.microsoft.com/Forums/en-US/windowshpcdevs/thread/ca4d2498-54ea-4ce7-9843-a5a0c83f447b#11b73946-0b73-4349-ac0e-10cca558b634 // void Sample_CannotBeDeletedBatchFile() { using (IScheduler scheduler = new Scheduler()) { ISchedulerJob job = null; ISchedulerTask task = null; scheduler.Connect(SERVER_NAME); // Create a job and add a task to the job. job = scheduler.CreateJob(); scheduler.AddJob(job); task = job.CreateTask(); string jobDataFolder = String.Format(@"\\{0}\Data_Transfer\JobsData\Job{1:0000000}", SERVER_NAME, job.Id); string batchFilePath = Path.Combine(jobDataFolder, String.Format("jobCommand{0:0000000}.cmd", job.Id)); StringBuilder batchFileContent = new StringBuilder(); batchFileContent.AppendLine(@"dir d:\"); Directory.CreateDirectory(jobDataFolder); File.WriteAllText(batchFilePath, batchFileContent.ToString()); job.Name = "Sample_CannotBeDeletedBatchFile"; job.Project = "R2_Investigation"; task.CommandLine = batchFilePath; task.Type = TaskType.ParametricSweep; task.StartValue = 1; task.EndValue = 10; task.IncrementValue = 1; job.AddTask(task); // Specify the events that you want to receive. job.OnJobState += JobStateCallback_CannotBeDeletedBatchFile; job.OnTaskState += TaskStateCallback_CannotBeDeletedBatchFile; try { // Start the job. scheduler.SubmitJob(job, USER_NAME, USER_PASSWORD); // Blocks so the events get delivered. One of your event // handlers need to set this event. manualEvent.WaitOne(); } finally { job.OnJobState -= JobStateCallback_CannotBeDeletedBatchFile; job.OnTaskState -= TaskStateCallback_CannotBeDeletedBatchFile; } Console.WriteLine("{0}: before File.Delete({1}", DateTime.Now, batchFilePath); bool batchFileCannotBeDeleted = true; do { Console.WriteLine(); try { File.Delete(batchFilePath); batchFileCannotBeDeleted = false; Console.WriteLine("{0}: File.Deleted({1})", DateTime.Now, batchFilePath); } catch (Exception exp) { Console.WriteLine("{0}: File.Delete({1}) ==> exp {2}", DateTime.Now, batchFilePath, exp.Message); Thread.Sleep(20); } } while (batchFileCannotBeDeleted); Console.Write("\nPress Enter to quit"); Console.ReadLine(); } } static void JobStateCallback_CannotBeDeletedBatchFile(object sender, IJobStateEventArg args) { Console.WriteLine("JobStateCallback: Job state is " + args.NewState); if (JobState.Canceled == args.NewState || JobState.Failed == args.NewState || JobState.Finished == args.NewState) { manualEvent.Set(); } } static void TaskStateCallback_CannotBeDeletedBatchFile(object sender, ITaskStateEventArg args) { Console.WriteLine("TaskStateCallback: State for task {0} is {1}", args.TaskId, args.NewState); } } }
Also I attach console log of the sample.According the log, our sample tries to delete prepared batch file more than 10 seconds.
JobStateCallback: Job state is Submitted JobStateCallback: Job state is Validating JobStateCallback: Job state is Queued JobStateCallback: Job state is Running TaskStateCallback: State for task 7671.1.1 is Dispatching TaskStateCallback: State for task 7671.1.2 is Dispatching TaskStateCallback: State for task 7671.1.3 is Dispatching TaskStateCallback: State for task 7671.1.4 is Dispatching TaskStateCallback: State for task 7671.1.5 is Dispatching TaskStateCallback: State for task 7671.1.6 is Dispatching TaskStateCallback: State for task 7671.1.7 is Dispatching TaskStateCallback: State for task 7671.1.8 is Dispatching TaskStateCallback: State for task 7671.1.9 is Dispatching TaskStateCallback: State for task 7671.1.10 is Dispatching TaskStateCallback: State for task 7671.1.2 is Running TaskStateCallback: State for task 7671.1.4 is Running TaskStateCallback: State for task 7671.1.2 is Finished TaskStateCallback: State for task 7671.1.4 is Finished TaskStateCallback: State for task 7671.1.8 is Running TaskStateCallback: State for task 7671.1.1 is Running TaskStateCallback: State for task 7671.1.3 is Running TaskStateCallback: State for task 7671.1.10 is Running TaskStateCallback: State for task 7671.1.5 is Running TaskStateCallback: State for task 7671.1.3 is Finished TaskStateCallback: State for task 7671.1.9 is Finished TaskStateCallback: State for task 7671.1.7 is Finished TaskStateCallback: State for task 7671.1.6 is Finished TaskStateCallback: State for task 7671.1.10 is Finished TaskStateCallback: State for task 7671.1.8 is Finished TaskStateCallback: State for task 7671.1.5 is Finished TaskStateCallback: State for task 7671.1.1 is Finished JobStateCallback: Job state is Finished 1/23/2011 10:45:21 AM: before File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Jo b0007671\jobCommand0007671.cmd 1/23/2011 10:45:22 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:23 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:24 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:25 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:26 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:27 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:28 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:29 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:30 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:31 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:32 AM: File.Delete(\\il-winhpc2\Data_Transfer\JobsData\Job000767 1\jobCommand0007671.cmd) ==> exp The process cannot access the file '\\il-winhpc 2\Data_Transfer\JobsData\Job0007671\jobCommand0007671.cmd' because it is being u sed by another process. 1/23/2011 10:45:33 AM: File.Deleted(\\il-winhpc2\Data_Transfer\JobsData\Job00076 71\jobCommand0007671.cmd) Press Enter to quit
Thank you for any help,
Igor.
Sunday, January 23, 2011 8:54 AM -
Hi Igor,
First of all thank you very much for all the effort in creating the repro sample!
Unfortunatelly when I tried it on my test cluster I didn't see your problem happening... This is what I've got:
JobStateCallback: Job state is Submitted
JobStateCallback: Job state is Validating
JobStateCallback: Job state is Queued
JobStateCallback: Job state is Running
TaskStateCallback: State for task 19.1.1 is Dispatching
TaskStateCallback: State for task 19.1.2 is Dispatching
TaskStateCallback: State for task 19.1.3 is Dispatching
TaskStateCallback: State for task 19.1.4 is Dispatching
TaskStateCallback: State for task 19.1.5 is Dispatching
TaskStateCallback: State for task 19.1.6 is Dispatching
TaskStateCallback: State for task 19.1.7 is Dispatching
TaskStateCallback: State for task 19.1.8 is Dispatching
TaskStateCallback: State for task 19.1.9 is Dispatching
TaskStateCallback: State for task 19.1.10 is Dispatching
TaskStateCallback: State for task 19.1.5 is Running
TaskStateCallback: State for task 19.1.10 is Running
TaskStateCallback: State for task 19.1.8 is Running
TaskStateCallback: State for task 19.1.8 is Finished
TaskStateCallback: State for task 19.1.5 is Finished
TaskStateCallback: State for task 19.1.10 is Finished
TaskStateCallback: State for task 19.1.2 is Running
TaskStateCallback: State for task 19.1.2 is Finished
TaskStateCallback: State for task 19.1.3 is Running
TaskStateCallback: State for task 19.1.6 is Running
TaskStateCallback: State for task 19.1.3 is Finished
TaskStateCallback: State for task 19.1.6 is Finished
TaskStateCallback: State for task 19.1.9 is Running
TaskStateCallback: State for task 19.1.7 is Running
TaskStateCallback: State for task 19.1.1 is Running
TaskStateCallback: State for task 19.1.4 is Running
TaskStateCallback: State for task 19.1.4 is Finished
TaskStateCallback: State for task 19.1.1 is Finished
TaskStateCallback: State for task 19.1.7 is Finished
TaskStateCallback: State for task 19.1.9 is Finished
JobStateCallback: Job state is Finished
1/24/2011 1:14:03 PM: before File.Delete(\\lukasztcluster\\testshare\tests\Job0000019\jobCommand0000019.cmd)
1/24/2011 1:14:03 PM: File.Deleted(<\\lukasztcluster\\testshare\tests \Job0000019\jobCommand0000019.cmd)I think one thing worth trying will be to determine which process is holding the file. This can be done by using handle.exe utility, which can be found here: http://technet.microsoft.com/en-us/sysinternals/bb896655.aspx
If the process is on one of the computenodes you can use clusrun command to run handle.exe on all of them like this:
c:\> clusrun /interleaved \\lukasztcluster\tools\handle.exe -a jobCommand0000019.cmd -accepteula | findstr /i jobCommand0000019.cmd
LUKASZTCN01:LockFile.exe pid: 2280 16C: \Device\Mup\lukasztcluster\testshare\tests\Job0000019\jobCommand0000019.cmdPlease let me know what you've found.
Thank you,
ŁukaszMonday, January 24, 2011 9:46 PM -
Hi Igor,
First of all thank you very much for all the effort in creating the repro sample!
Unfortunatelly when I tried it on my test cluster I didn't see your problem happening... This is what I've got:
JobStateCallback: Job state is Submitted
JobStateCallback: Job state is Validating
JobStateCallback: Job state is Queued
JobStateCallback: Job state is Running
TaskStateCallback: State for task 19.1.1 is Dispatching
TaskStateCallback: State for task 19.1.2 is Dispatching
TaskStateCallback: State for task 19.1.3 is Dispatching
TaskStateCallback: State for task 19.1.4 is Dispatching
TaskStateCallback: State for task 19.1.5 is Dispatching
TaskStateCallback: State for task 19.1.6 is Dispatching
TaskStateCallback: State for task 19.1.7 is Dispatching
TaskStateCallback: State for task 19.1.8 is Dispatching
TaskStateCallback: State for task 19.1.9 is Dispatching
TaskStateCallback: State for task 19.1.10 is Dispatching
TaskStateCallback: State for task 19.1.5 is Running
TaskStateCallback: State for task 19.1.10 is Running
TaskStateCallback: State for task 19.1.8 is Running
TaskStateCallback: State for task 19.1.8 is Finished
TaskStateCallback: State for task 19.1.5 is Finished
TaskStateCallback: State for task 19.1.10 is Finished
TaskStateCallback: State for task 19.1.2 is Running
TaskStateCallback: State for task 19.1.2 is Finished
TaskStateCallback: State for task 19.1.3 is Running
TaskStateCallback: State for task 19.1.6 is Running
TaskStateCallback: State for task 19.1.3 is Finished
TaskStateCallback: State for task 19.1.6 is Finished
TaskStateCallback: State for task 19.1.9 is Running
TaskStateCallback: State for task 19.1.7 is Running
TaskStateCallback: State for task 19.1.1 is Running
TaskStateCallback: State for task 19.1.4 is Running
TaskStateCallback: State for task 19.1.4 is Finished
TaskStateCallback: State for task 19.1.1 is Finished
TaskStateCallback: State for task 19.1.7 is Finished
TaskStateCallback: State for task 19.1.9 is Finished
JobStateCallback: Job state is Finished
1/24/2011 1:14:03 PM: before File.Delete(\\lukasztcluster\\testshare\tests\Job0000019\jobCommand0000019.cmd)
1/24/2011 1:14:03 PM: File.Deleted(<\\lukasztcluster\\testshare\tests \Job0000019\jobCommand0000019.cmd)I think one thing worth trying will be to determine which process is holding the file. This can be done by using handle.exe utility, which can be found here: http://technet.microsoft.com/en-us/sysinternals/bb896655.aspx
If the process is on one of the computenodes you can use clusrun command to run handle.exe on all of them like this:
c:\> clusrun /interleaved \\lukasztcluster\tools\handle.exe -a jobCommand0000019.cmd -accepteula | findstr /i jobCommand0000019.cmd
LUKASZTCN01:LockFile.exe pid: 2280 16C: \Device\Mup\lukasztcluster\testshare\tests\Job0000019\jobCommand0000019.cmdPlease let me know what you've found.
Thank you,
Łukasz
Hi Łukasz,I run received sample (with our path and names) on first node (I checked the node was used for run job), but unfortunately the run command (clusrun) didn't show nothing .
I run several times and, suddenly, the first node rebooted.
Now I ask our IT check the issue (first node).When the issue will be investigated and solved I am going to run "clusrun" again.
Igor.
- Edited by ihar_z Tuesday, January 25, 2011 10:23 AM spelling
Tuesday, January 25, 2011 10:19 AM -
Hi Igor,
...Please let me know what you've found.
Thank you,
ŁukaszHi Łukasz,
I tried to use utility Processmonitor with single filter rule "path contains jobCommand" and found many events on all sides (client, server, node).
I don't want upload the log files on public place. How can I send the files (several megabytes) to you?
Igor.
Tuesday, January 25, 2011 11:34 AM -
Hi Igor,
Just for clarification, did you run 'clusrun handle.exe' while your sample was still showing file being used by another process? Also if you are running your sample to use only single node you can run 'handle.exe -a <script_name>' directly on this node (while logged on with remote desktop) without using 'clusrun'.
Another question, is the network share located on your headnode? Is there a chance, that some file monitoring software (like antivirus) is running there and locking the file for some reason? Is this problem also occuring if you try to generate script as the local file (without using network share)?
You can send me the logs via email (lutom@microsoft.com). However, I am not sure if I'll be able to help if you're not seeing anything via handle.exe. Also information about which file was locked at the time of obtaining logs will be useful (together with timestamps of Delete exceptions).
Regards,
ŁukaszTuesday, January 25, 2011 4:21 PM -
Hi Igor,
Just for clarification, did you run 'clusrun handle.exe' while your sample was still showing file being used by another process? Also if you are running your sample to use only single node you can run 'handle.exe -a <script_name>' directly on this node (while logged on with remote desktop) without using 'clusrun'.
Another question, is the network share located on your headnode? Is there a chance, that some file monitoring software (like antivirus) is running there and locking the file for some reason? Is this problem also occuring if you try to generate script as the local file (without using network share)?
You can send me the logs via email (lutom@microsoft.com ). However, I am not sure if I'll be able to help if you're not seeing anything via handle.exe. Also information about which file was locked at the time of obtaining logs will be useful (together with timestamps of Delete exceptions).
Regards,
Łukasz
Hi Łukasz,
I am going to discuss the questions with our IT and run several additional tests.
Results will be published here (maybe logs will be sent on your email).
thank you,
Igor.Tuesday, January 25, 2011 7:04 PM -
Hi Igor,
Just for clarification, did you run 'clusrun handle.exe' while your sample was still showing file being used by another process? Also if you are running your sample to use only single node you can run 'handle.exe -a <script_name>' directly on this node (while logged on with remote desktop) without using 'clusrun'.
Another question, is the network share located on your headnode? Is there a chance, that some file monitoring software (like antivirus) is running there and locking the file for some reason? Is this problem also occuring if you try to generate script as the local file (without using network share)?
You can send me the logs via email (lutom@microsoft.com ). However, I am not sure if I'll be able to help if you're not seeing anything via handle.exe. Also information about which file was locked at the time of obtaining logs will be useful (together with timestamps of Delete exceptions).
Regards,
Łukasz
When I run the program from server computer (head-node) no problems (folder was deleted from first time) were found.
So, problem with deleting was found, when the program run from client computer .
Igor.Thursday, January 27, 2011 10:54 AM -
thanks for sharing really with your thread i have learn more things.. thanks again.
We are providing microsoft products frontpage 2003Thursday, February 10, 2011 6:53 AM -
thanks for sharing really with your thread i have learn more things.. thanks again.
We are providing microsoft products frontpage 2003
nice thanks for sharing
frontpage 2003Wednesday, March 2, 2011 9:46 AM