none
Can no longer run Head Node diagnostics: Internal exception happen when deal with run: Logon failure: unknown user name or bad password

    السؤال

  • I find that all the diagnostics that were working last week are no longer working on my head node. When I run them, State immediately goes to "Failed To Run" and the Message is "Internal exception happen when deal with run: Logon failure: unknown user name or bad password"

    To which username/password is this message referring, please?

    06/شعبان/1432 05:39 م

جميع الردود

  • Here's the full trace from Event Viewer:

    + System
    - Provider
    [ Name] Microsoft-HPC-Diagnostics
    [ Guid] {5FD636C3-5ADD-4564-93A3-C8BAB2527FCA}
    EventID 7
    Version 0
    Level 2
    Task 0
    Opcode 0
    Keywords 0x2000000000000000
    - TimeCreated
    [ SystemTime] 2011-07-07T20:42:35.266159600Z
    EventRecordID 49
    Correlation
    - Execution
    [ ProcessID] 1136
    [ ThreadID] 1220
    Channel Windows HPC Server
    Computer XXXX
    - Security
    [ UserID] S-1-5-18
    - EventData
    Message Logon failure: unknown user name or bad password.
        ExceptionString Exception detail: System.Security.SecurityException: Logon failure: unknown user name or bad password. at System.Security.Principal.WindowsIdentity.KerbS4ULogon(String upn) at System.Security.Principal.WindowsIdentity..ctor(String sUserPrincipalName, String type) at System.Security.Principal.WindowsIdentity..ctor(String sUserPrincipalName) at Microsoft.Hpc.Diagnostics.Controller.Utilities.CreateJob(ISchedulerStore store, String requestedBy, StoreProperty[] jobProps) at Microsoft.Hpc.Diagnostics.Controller.SubmittedTestHandler.StartPreTask(DiagnosticTestRun testRun, DiagnosticTest testDef) at Microsoft.Hpc.Diagnostics.Controller.SubmittedTestHandler.ExecuteInternal(DiagnosticTestRun testRun) at Microsoft.Hpc.Diagnostics.Controller.StateHandlerBase.Execute() The Zone of the assembly that failed was: MyComputer Current stack: at Microsoft.Hpc.Diagnostics.DiagnosticsTracing.TraceException(String facility, Exception exception) at Microsoft.Hpc.Diagnostics.Controller.StateHandlerBase.Execute() at Microsoft.Hpc.Diagnostics.Controller.DiagnosticsController.RunStateHandlers(Object o) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading._ThreadPoolWaitCallback.PerformWaitCallbackInternal(_ThreadPoolWaitCallback tpWaitCallBack) at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(Object state)  

    06/شعبان/1432 08:42 م
  • Hi

    I'd suggest to reenter your diagnostics credentials since the cached one might be expired or changed.

    To do so, please open UI console, select Diagnostics Pane, then Options from the top menu and 'Clear Diagnostic Test credentials'. Then try to rerun a test - you should be prompted for credentials when running the first time.

     

    14/شعبان/1432 07:19 م
  • Hi

    I'd suggest to reenter your diagnostics credentials since the cached one might be expired or changed.

    To do so, please open UI console, select Diagnostics Pane, then Options from the top menu and 'Clear Diagnostic Test credentials'. Then try to rerun a test - you should be prompted for credentials when running the first time.

     

    Yep, already tried that, with several different admin credentials - no luck.
    14/شعبان/1432 08:09 م
  • I should add that although I am still not able to successfully run ANY of the diagnostics, the EchoService and my own test service seem to operate correctly.
    17/شعبان/1432 12:36 م
  • Could you try to set the password from the PSH? First cler it in the UI and then open the PSH window and type: Set-HpcTestCredential

    Please make sure the admin is a domain user and you can connect to the DC.

    19/شعبان/1432 06:33 م
  • When I run that cmdlet I get:




    Set-HpcTestCredential : User ***\**** doesn't has permission to access.
    At line:1 char:22
    + set-hpctestcredential <<<<
        + CategoryInfo          : NotSpecified: (Microsoft.Compu...tTestCredential
       :SetTestCredential) [Set-HpcTestCredential], DiagnosticException
        + FullyQualifiedErrorId : Microsoft.ComputeCluster.CCPPSH.SetTestCredential




    The user in question is a domain user, is signed into the HPC Head Node (so I assume the connection with the DC is OK), and is an Administrator of the HPC Head Node. Is there anything else required just to run diagnostics?


    21/شعبان/1432 12:40 م
  • By saying "is an Administrator of the HPC Head Node" you mean he is added to the Administrators group (Configuration Pane/Users will show you the list) or he is added to the HeadNode only? The user has to be HPC Admin User added there.

    Please note that the user who runs the "Set-HpcTestCredential" has to be admin as well.

    The error message suggest you are trying to run the command to set the password by a non-admin user.

    There was also a problem prior to SP2 if run on non-English version (German), is it you case?

    04/رمضان/1432 04:59 م
  • The user appears under HPC Cluster Manager->Configuration->Users as Role "Administrator", and also under Start Menu->Administrative Tools->Computer Management->Local Users and Groups->Administrators.

    [I'm guessing these are the same thing anyway]

    04/رمضان/1432 05:24 م
  • I'm running into this exact problem. If anyone has any ideas, I would really appreciate it.
    23/رمضان/1432 10:14 م
  • Do you have/use the same username that has both account in domain and local (HN)? From the stact trace above (earlier posts) there is a message about wrong Zone, so perhaps the name resolved to local account first and couldn't continue.

    Can you perform other administraction operations using this user? E.g. try "set-hpcnodestete" to change online/offline. Try doing so from a remote computer logged in as the user you are trying to use for diagnostics (use -scheduler option in the cmdlet). If it succeeds you will be sure the user is an cluster admin.

    24/رمضان/1432 04:23 م
  • Do you have/use the same username that has both account in domain and local (HN)? From the stact trace above (earlier posts) there is a message about wrong Zone, so perhaps the name resolved to local account first and couldn't continue.

    Can you perform other administraction operations using this user? E.g. try "set-hpcnodestete" to change online/offline. Try doing so from a remote computer logged in as the user you are trying to use for diagnostics (use -scheduler option in the cmdlet). If it succeeds you will be sure the user is an cluster admin.

    In my case I have no local users except Administrator and Guest (disabled) on the head node - I'm doing everything via my domain account. To what "Zone" message are you referring?
    24/رمضان/1432 05:10 م
  • Do you have/use the same username that has both account in domain and local (HN)? From the stact trace above (earlier posts) there is a message about wrong Zone, so perhaps the name resolved to local account first and couldn't continue.

    Can you perform other administraction operations using this user? E.g. try "set-hpcnodestete" to change online/offline. Try doing so from a remote computer logged in as the user you are trying to use for diagnostics (use -scheduler option in the cmdlet). If it succeeds you will be sure the user is an cluster admin.

    Set-HpcNodeState (when run from an elevated Powershell) works just fine, both locally and remotely.
    24/رمضان/1432 06:41 م
  • Just to make sure but when you clear the password (Remove-HpcTestCredential) and set it again (Set-HpcTestCredential) did you run it from elevated PSH? Does clearing the credential succeeds?

    Please try that: restart HpcDiagnostics service, clear the password then start the test and you should see the propmt for password in UI.

    If it doesn't work then something bad must have happened that I find hard to imagine. There was another thread about same issue you may want to check (http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/3f35bb97-29a0-442e-91d0-2b5f467b502d)

    25/رمضان/1432 05:44 م
  • Just to make sure but when you clear the password (Remove-HpcTestCredential) and set it again (Set-HpcTestCredential) did you run it from elevated PSH? Does clearing the credential succeeds?

    Please try that: restart HpcDiagnostics service, clear the password then start the test and you should see the propmt for password in UI.

    If it doesn't work then something bad must have happened that I find hard to imagine. There was another thread about same issue you may want to check (http://social.microsoft.com/Forums/en-US/windowshpcitpros/thread/3f35bb97-29a0-442e-91d0-2b5f467b502d)

    Well, that didn't work.

    I can clear the credentials from an elevated powershell, and set them again (from the powershell and from the UI prompt) without error, but the diagnostics still give the same error.

    Whatever the "something bad" is, it's happening consistently on my network -- I've just installed two more clusters from bare metal and I have identical problems with those.



    25/رمضان/1432 07:31 م
  • I looked deeper and it seems like we looked into the wrong place. The user/password you provide is not in effect yet at this stage (and clearly the credentials are ok). The service will use user "Submit By:" to create scheduler jobs to run the test.

    We need to inspect this a little deeper as this is the source of the problem. First identify this user:

    - start a test

    - go to the results and select failed test

    - go to the 2nd tab (Test details) and check who is: "Submit By:"

    Now

    Please check it it has domain\user form or simply user and there are no other '\' characters. There shouldn't.

    For some reason System can't create such identity. System can be anyone so this must be a domain configuration problem. If it happens to be a local user try using a domain user. This has to be admin as well.

    If all this is still true, there is also a possibility that a profile on the HN got is a wrong state and then easiest thiing to try would be to logout, use a different admin to delete the profile so that it can be recreated.

    One symptom of a bad profile is that you will not be able to remote desktop to other computers from this HN.

    26/رمضان/1432 10:22 م
  • This is interesting because, I'm running into this exact same problem, and I see the correct domain\user in the submit by field, when I look at the failed diagnostics. Other than the failed diagnostics, I see no other problems. I can add nodes, reimage nodes, etc, create and submit jobs, and remote desktop into all compute nodes using my domain account.

    For what it's worth, are there any security policies that could interfere with running diagnostics? I attached my head node to our campus domain, and a number of policies were pushed down by the domain controller. When I experimented with the HPC Server software, I started with an isolated dummy domain and had no problems with the diagnostics. Now that I am on our campus domain, the diagnostics don't work.

    Any ideas? I've followed the rest of this thread and my results are the same as wbradney's.

    27/رمضان/1432 10:48 م
  • This is interesting because, I'm running into this exact same problem, and I see the correct domain\user in the submit by field, when I look at the failed diagnostics. Other than the failed diagnostics, I see no other problems. I can add nodes, reimage nodes, etc, create and submit jobs, and remote desktop into all compute nodes using my domain account.

    For what it's worth, are there any security policies that could interfere with running diagnostics? I attached my head node to our campus domain, and a number of policies were pushed down by the domain controller. When I experimented with the HPC Server software, I started with an isolated dummy domain and had no problems with the diagnostics. Now that I am on our campus domain, the diagnostics don't work.

    Any ideas? I've followed the rest of this thread and my results are the same as wbradney's.

    During my original prototyping, I too was running in a "personal" domain, and was able to run the diagnostics. I'm only having these problems when testing in my company's production domain.
    28/رمضان/1432 12:32 ص
  • Well, domain configuration is a huge topic and it's had to think of all different configurations you might have. From what was said so far, it looks like perhaps you could ask your domain administrator about configuration/restrictions for users. I'll be happy to hear if there are certain policies needed that have these results.

    29/رمضان/1432 06:49 م
  • As you said, domain configuration is a huge topic. Before I can productively engage my domain administrator to start troubleshooting whether or not any of the policies are interfering with the diagnostics, it would help to know how the diagnostics service works. Does the HPC Pack use any special mechanisms to launch the diagnostics? Is there anything special in the way that the service is using the credentials to authenticate and launch the diagnostic processes? Does anyone know how these mechanisms work?

    29/رمضان/1432 09:15 م
  • Is this significant? Look at the error message above:

    - Security
    UserID] S-1-5-18

    This is the system profile, not the user supplied in the diagnostic prompt.

     

    01/شوال/1432 02:04 م
  • There is nothing exceptional there. The service uses standard API. Let me give you the overview:

    - when a user submits a test to run - the service will impersonate this user to create the admin jobs [pre/run/post]

    - when the job is run (by scheduler) the credentials stored as diagnostics credentails are used to run the job

    The first step when service (system) uses submitter credentials to create the jobs is the problem here. the next step is not even reached but I wouldn't anticipate any issues since you say you can run any other jobs and this is the same mechanism in use.

    This is quite uncommon case that this fails (actually only you two report it) and I was unable to reproduce the configuration. There must be something special in your configuration and it's tricky to pinpoint it.

    01/شوال/1432 04:08 م
  • That's expected - the system (service) attempts to perform the operation and fails. Normally system profile can do anything so perhaps some policy is overriding the settings.
    01/شوال/1432 04:10 م
  • After some experimentation with our sys admin, it appears that the only way we can find to submit diagnostics is to be logged in as, and running HPC Cluster Manager as, the "god" domain administrator account. Even when we created a new account and added it to the Domain Admins group, when logged into the head node as that user we could not submit diagnostics.

    Note that it doesn't seem to matter what the "test credentials" are set to in HPC - you have to be logged into the node as a super user to submit the tests.

    We're now trying to figure out what the differences are between the "god" account and the account with domain admin rights. It sure would be a help if we had documentation on exactly what domain privileges are required to submit diagnostics.


    03/شوال/1432 04:16 م
  • We run into the exact the same situation.
    What is even weirder after the reinstallation of the HPC Pack and setting the new clean databases the problem appeared again.

    What is crossing my mind is that it can have something to do with the databases and permissions.

     

    We are running HPC Pack 2008 SP3.

    Anybody found a solution?

    12/صفر/1433 02:25 م
  • I just installed SP2 on HPC and am having the same issue. The problem is caused by SP2 (and probably caries over to SP3 since people with sp3 are having the same issue). We did not have this issue before while running SP1. Here is what we have found:

    It only occurs with accounts from a trusted domain. When i type in the trusted domain account in the diagnostic test credentials box from the trusted domain, it says unknown username or bad password.(yes the user is in the local admin group and thus in HPC as an admin group member as well). when i type an account (logon credentials) from a domain that the head node is a member of (native domain), it takes/accepts the user name and password BUT the tests still fail IF you logged in to the server using the trusted domain account(that is how i caught this problem). The reason for that is that when the tests are run, it is not running the tests with the credentials you provide in the credentials box but the credentails you are logged in with! so basically it does not even care about the credentials you provide in the text credentials box when it comes to actually using them to run the tests. It just takes the credentials (if they are in the native domain) and then runs the tests using the account you logged in with anyway...

    The event viewer in the HPC logs is generating the following error after the test fails(when u use the trusted domain id)...We checked everything related to AD and everything is fine.

    "An unexpected exception occurred. For more information about this exception, see the Details tab.

    Additional data:

    The security database on the server does not have a computer account for this workstation trust relationship."

     

    15/صفر/1433 04:07 م
  • That corresponds with what I've seen. In our AD forest, we have a separate resource domain that all machines live in. All users are in one of the other domains by different organizations on campus. I do not have anything to do with the AD administration, but I assume they have a trust relationship with each other. I wonder if this would help someone from Microsoft better replicate this and see if there is a bug or a workaround.

    26/صفر/1433 10:40 م