none
Jobs fail for no apperent reason RRS feed

  • Question

  • Hello,

    I have not seen this before but all of a sudden I am having some jobs fail for no apperent reason. The first 10 work  and the remainder fail. Can't seem to figure out where this came from. I know that 10 is a magic number but I never had this issue before.

    The error that I see is "Failed to activate task ***** Exception 'Working Directory "\\server\share\dir' does not exist' reported creating a new task for job ID ***, task ID 1.

    I then test to see if I could run -- Clusrun dir \\server\share\ > test.txt

    It works for ten or so computer and for the rest tells me " No more connections can be made to this remote computer at this time because there are already as many connections as the computer can accept"

    I looked at my Reg Key: HKLM\SYSTEM\CurrentControlSet\Services\LicesneInfo\FilePrint\ConcurrentLimit and it is set 1028.

    Any ides what is going on?

    Thanks,

    Ilya

    Friday, May 30, 2008 10:21 PM

Answers

  • All my nodes are running Windows 2003 CCS, so they should not suffer from the limit of 10 connections.


    Also have an additional directory beyond \\server\share is unnecessary but I have that anyways. 

    Anyways, I figured out the problem but I am still not sure how it manifested.

    For my shared directory if i click on properties and then the share tab there is an option to limit the number of simultaneous connections. I had it set to the Maximum number of allowed which was causing my problem. For some reason the maximum number of allowed is set to 10 and I am not sure where the maximum number of allowed is actually set. I then set the number of connection to 1000 and everything worked just fine.

    Thanks,

    Ilya

    Saturday, May 31, 2008 5:07 PM

All replies

  •  

    Try running "mkdir \\server\share\dir" . . . I think the job is failing because the "dir" directory doesn't exist, not because the share doesn't exist.

     

    -Josh

    Saturday, May 31, 2008 8:06 AM
  • Hi Ilya,

    Typically this is caused by \\server being a workstation such as Vista or Windows XP where the number of simultaneous connections is limited to 10. Try using another server such as one of your compute nodes or the head node and it should work fine. 

    Saturday, May 31, 2008 3:30 PM
  • All my nodes are running Windows 2003 CCS, so they should not suffer from the limit of 10 connections.


    Also have an additional directory beyond \\server\share is unnecessary but I have that anyways. 

    Anyways, I figured out the problem but I am still not sure how it manifested.

    For my shared directory if i click on properties and then the share tab there is an option to limit the number of simultaneous connections. I had it set to the Maximum number of allowed which was causing my problem. For some reason the maximum number of allowed is set to 10 and I am not sure where the maximum number of allowed is actually set. I then set the number of connection to 1000 and everything worked just fine.

    Thanks,

    Ilya

    Saturday, May 31, 2008 5:07 PM