Deleted original Enterprise adapter, and now cannot reconfigure cluster to fix it RRS feed

  • Question

  • Hi,

    I've made a big mess of our HPC Cluster, and I don't know how to fix it.

    We decided to upgrade the Enterprise connection from 1 to 10 gig. It was on an adapter named 'Enterprise'. So I decided the best thing to do, to keep things consistent, was call the new connection (on a new card) Enterprise.

    So, we renamed 'Enterprise' to 'Enterprise.old', put the new one in and called it 'Enterprise'. Cluster stopped working (couldn't even do CLUSRUN commands). In the course of trying to dig ourselves out of this mess, we deleted 'Enterprise.old'. Now we are very stuck, because we can't reconfigure the networking :( The cluster still thinks it should only talk to the old 1 Gig adapter and does not let me change it ...

    I managed to change the topology to 'Enterprise network only' because that did not involve changing the Enterprise adapter. My train of thought was ... maybe I can then change the Enterprise adapter to the Private adapter (thus getting rid of the old Enterprise adapter), and then change it all back to how it should be configured. But it errors out because it complains it can't find the adapter in WMI (probably because, even though I recreated the old Enterprise setup, it has a different GUID) and refuses to deconfigure it. I think we are now well and truly stuck, as I can neither roll back to the old adapter or forwards to the new one. Short of trying to restore the head node from backup, I can't think of a way to fix this.

    Is there a way...?

    I hope someone out there can help me get the cluster back online this weekend.


    Saturday, October 11, 2014 9:48 PM

All replies

  • In order to do a local repro in our test machines and analyze the issue, I want to confirm all the operations you took:

    1. The cluster is in Enterprise + Private.

    2. You added 10G adapters to all the nodes in the cluster.

    3. On HN, you renamed 1G adapter from Enterprise to Enterprise.old and gave this name to the 10G adapter.

    4. Cluster stopped working.

    5. On HN, you deleted Enterprise.old.

    6. You changed the topology to Enterprise Only.

    7. On HN, you renamed Enterprise to Private and then renamed it back.

    And here are several questions:

    1. Were Enterprise and Enterprise.old located in the same subnet?

    2. Did you see any error messages in Cluster Manager? No matter where and when you saw them, please share the details.

    3. Did you perform renaming/deleting adapters only on HN or on all the nodes in the cluster?

    4. Where did you see the WMI error?

    5. Ideally, you only want to upgrade all the nodes' Enterprise adapter from 1G to 10G, right? Is the 1G adapter planned to be removed?

    Monday, October 13, 2014 3:06 AM