Scheduler.Connect throws "could not register with the server. try again later." exception when connecting from compute node
Wednesday, June 15, 2011 4:42 PM
First, some background. I am running two Windows HPC Server 2008 clusters, one for development and one for production workloads. The two clusters are similarly configured, except that the development cluster has a single headnode and all nodes (head and compute) on the same network, while the production cluster has two head nodes in a failover cluster and all compute nodes on a private network.
My application is designed such that the completion of one job may submit one new job to the HPC cluster. On the development cluster, this works fine - jobs are able to connect to the head node to kick off new jobs.
However, I am running into the following issue when attempting to connect to the production cluster from a running job:
unhandled exception: microsoft.hpc.scheduler.properties.schedulerexception: could not register with the server. try again later. at microsoft.hpc.scheduler.store.storeserver._connect() at microsoft.hpc.scheduler.store.storeserver.connect(string server, int32 port) at microsoft.hpc.scheduler.store.schedulerstoresvc..ctor(string server, int32 port) at microsoft.hpc.scheduler.scheduler.connect(string cluster)
This issue appears for all connections from within the cluster (that is, 'try again later' does not help). Further, if the job is run directly on the headnode, the job is able to connect to the cluster. Any idea what might be going on?
Thursday, June 30, 2011 1:00 PM
The issue is with the way the compute nodes are configured in the new cluster. Since they are in their own private network, they are unable to resolve the FQDN of the headnode on the external network. A quick test showed that switching the HPC tasks to schedule other tasks by connecting to the "internal" network IP of the headnode lets them overcome this problem.
My guess is the correct fix for this might be to add a DNS entry for the compute node network to resolve the FQDN of the headnode to the internal private network IP of the headnode rather than the external network IP.