none
Windows 2008 HPC R2, Issues with filenames that contain a tilde ~

    Frage

  • We recently upgrade to Windows 2008 HPC Pack R2.  Since upgrading we have been having very odd intermittent issues with the compute nodes which are running Windows HPC server (not R2) with the R2 Pack installed.  All nodes are connected with 1gig or 10gig Ethernet to a High Performance NAS system.

     

    What we were seeing is that sometimes jobs would just hang in a zombie state on random compute nodes.  After canceling the job in cluster manager, the applications that were called by scheduler could not be canceled with task manager, task kill, process explorer etc.  The only fix was to reboot the compute node.  After a lot of testing we narrowed down the fact that there was a tilde in the file name of the files being read in or written.  We tested the writing of the files with both a .net call within a custom application and a simple powershell script.  The format of the files are like this:  XXXX~BB~C.10085.19950623.1959-1959.csv.  The jobs read and write many thousands of files in this format.  If we run the HPC job writing files this way, the jobs will typically fail on 20 out of 50 compute servers.  If we change the tilde to an underscore we do not have any issues.  We originally thought it was the storage system that we were reading and writing the data to, but we have the same affect when reading and writing to a Windows 2008R2 standalone server.  Can you tell me if there are any known issues with this as we did not experience this with 2008 HPC.  We only started seeing this when we upgraded to R2.

    Any help is appreciated

    Samstag, 12. November 2011 02:50

Antworten

  • As for the tilde hanging the compute nodes, this has never been solved and we moved away from using them in the file names.  As for other hanging issues, we found that if we had .net code that called another say C code dll and that dll crashed, even though there was a try catch statement in the .net code, cluster manager would not recognize that the actual application or task actually failed.  We disabled WER and that seemed to fix the issue.
    • Bearbeitet rmag Dienstag, 10. Januar 2012 13:39
    • Als Antwort markiert rmag Dienstag, 10. Januar 2012 13:40
    Dienstag, 10. Januar 2012 13:39

Alle Antworten

  • Just to be clear - do you have a cluster where the head node is Windows 2008 HPC Pack R2 (RTM?, SP1?, SP2? or SP3?) and compute nodes running Windows HPC Server 2008?  If this is the case, is there any reason why you didn't upgrade your compute nodes as well?

     

    Or are you saying that now that you have upgraded your entire cluster that jobs that worked on HPC Pack are not working on HPC Pack R2?

    Thanks,

    Mark

     

    Freitag, 18. November 2011 21:22
  • Our head node operating system is Windows 2008 R2 HPC Edition with 2008 HPC Pack R2 and our compute node's operating systems are Windows 2008 HPC SP2 with 2008 HPC Pack R2 installed on them.  We did not upgrade the Operating system on the compute nodes for cost reasons as we were told that we did not need to upgrade anything ohter than the HPC pack on the compute nodes.

     

    We previously were operating everything with all OS' being Windows 2008 HPC SP2 and Windows 2008 HPC pack SP2.  We upgraded the head node OS and the HPC pack on all of the compute nodes and we have had issues ever since.  Not only with the ~ but with jobs just hanging and never doing any work, you have to manually force the nodes offline and then the job gets moved to another node and then runs through.  The jobs seem to get in a zombie state where all of the cores are allocated, but no CPU resources are being used and they will stay in that state forever unless you "help" it along.

    Dienstag, 29. November 2011 15:25
  • I am seeing the same behavior on our cluster since upgrading to SP2.  
    • Als Antwort markiert rmag Donnerstag, 15. Dezember 2011 14:00
    • Tag als Antwort aufgehoben rmag Donnerstag, 15. Dezember 2011 14:00
    Dienstag, 29. November 2011 16:32
  • I had originally thought that SP3 solved this problem, but looks like we still have issues with jobs just hanging where it allocates nodes or cores and shows as running in the process tree of the compute nodes, but nothing is actually happening.  We still have to force the nodes offline, reboot or manually cancel the tasks or it will just stay "running" forever with no progress.


    • Als Antwort markiert rmag Donnerstag, 15. Dezember 2011 14:01
    • Tag als Antwort aufgehoben rmag Montag, 19. Dezember 2011 00:46
    • Bearbeitet rmag Montag, 19. Dezember 2011 00:49
    Donnerstag, 15. Dezember 2011 14:01
  • As for the tilde hanging the compute nodes, this has never been solved and we moved away from using them in the file names.  As for other hanging issues, we found that if we had .net code that called another say C code dll and that dll crashed, even though there was a try catch statement in the .net code, cluster manager would not recognize that the actual application or task actually failed.  We disabled WER and that seemed to fix the issue.
    • Bearbeitet rmag Dienstag, 10. Januar 2012 13:39
    • Als Antwort markiert rmag Dienstag, 10. Januar 2012 13:40
    Dienstag, 10. Januar 2012 13:39