sabato 12 novembre 2011 02:50
We recently upgrade to Windows 2008 HPC Pack R2. Since upgrading we have been having very odd intermittent issues with the compute nodes which are running Windows HPC server (not R2) with the R2 Pack installed. All nodes are connected with 1gig or 10gig Ethernet to a High Performance NAS system.
What we were seeing is that sometimes jobs would just hang in a zombie state on random compute nodes. After canceling the job in cluster manager, the applications that were called by scheduler could not be canceled with task manager, task kill, process explorer etc. The only fix was to reboot the compute node. After a lot of testing we narrowed down the fact that there was a tilde in the file name of the files being read in or written. We tested the writing of the files with both a .net call within a custom application and a simple powershell script. The format of the files are like this: XXXX~BB~C.10085.19950623.1959-1959.csv. The jobs read and write many thousands of files in this format. If we run the HPC job writing files this way, the jobs will typically fail on 20 out of 50 compute servers. If we change the tilde to an underscore we do not have any issues. We originally thought it was the storage system that we were reading and writing the data to, but we have the same affect when reading and writing to a Windows 2008R2 standalone server. Can you tell me if there are any known issues with this as we did not experience this with 2008 HPC. We only started seeing this when we upgraded to R2.
Any help is appreciated
Tutte le risposte
venerdì 18 novembre 2011 21:22
Just to be clear - do you have a cluster where the head node is Windows 2008 HPC Pack R2 (RTM?, SP1?, SP2? or SP3?) and compute nodes running Windows HPC Server 2008? If this is the case, is there any reason why you didn't upgrade your compute nodes as well?
Or are you saying that now that you have upgraded your entire cluster that jobs that worked on HPC Pack are not working on HPC Pack R2?
martedì 29 novembre 2011 15:25
Our head node operating system is Windows 2008 R2 HPC Edition with 2008 HPC Pack R2 and our compute node's operating systems are Windows 2008 HPC SP2 with 2008 HPC Pack R2 installed on them. We did not upgrade the Operating system on the compute nodes for cost reasons as we were told that we did not need to upgrade anything ohter than the HPC pack on the compute nodes.
We previously were operating everything with all OS' being Windows 2008 HPC SP2 and Windows 2008 HPC pack SP2. We upgraded the head node OS and the HPC pack on all of the compute nodes and we have had issues ever since. Not only with the ~ but with jobs just hanging and never doing any work, you have to manually force the nodes offline and then the job gets moved to another node and then runs through. The jobs seem to get in a zombie state where all of the cores are allocated, but no CPU resources are being used and they will stay in that state forever unless you "help" it along.
martedì 29 novembre 2011 16:32I am seeing the same behavior on our cluster since upgrading to SP2.
giovedì 15 dicembre 2011 14:01
I had originally thought that SP3 solved this problem, but looks like we still have issues with jobs just hanging where it allocates nodes or cores and shows as running in the process tree of the compute nodes, but nothing is actually happening. We still have to force the nodes offline, reboot or manually cancel the tasks or it will just stay "running" forever with no progress.
martedì 10 gennaio 2012 13:39As for the tilde hanging the compute nodes, this has never been solved and we moved away from using them in the file names. As for other hanging issues, we found that if we had .net code that called another say C code dll and that dll crashed, even though there was a try catch statement in the .net code, cluster manager would not recognize that the actual application or task actually failed. We disabled WER and that seemed to fix the issue.