I have a HPC cluster with 1 head node that it is also the only computer node. Every 3 hours I submit the same job to hpc. It has 89 tasks. Randomly one of the task fails with error code -1073741629 - The network responded incorrectly.
An example of the task is : \\filer-path\...\myapp.exe param1 param2
Can anyone help me to understand why ?
Thank you very much
Without knowing a lot more about your application, I can't say for sure why you're seeing it. However, my best guess is that your network or file server sometimes has difficulties when a task is trying to access it for both the application executable and
the data file. Since you're only using one node, you could try copying the app and data locally, and trying the job again.