none
2016U1 REST API Hangs

    Question

  • I'm testing out the new JSON REST api unfortunately I think my install is broken as the api constantly hangs and eventually returns a 500 error. After restarting the service it works for a little bit before hanging again.

    Are there any logs I can check to try and figure out what's going on?

    I'm resigned to having to reinstall the Head Node, however before going down that route I'd like to see if there's any way to fix it and to do so I'd at a minimum need to see the logs. Any other suggestions of how to debug the problem will be gladly accepted!

    Thanks,

    Dave

    Thursday, 25 January 2018 11:57 AM

All replies

  • Hi Dave,

    Log of rest service is located at %CCP_LOGROOT_SYS%Scheduler\HpcWebService_*.bin

    You can get logviewer app at https://hpconlineservice.blob.core.windows.net/logviewer/LogViewer.UI.application

    Thanks,
    Zihao

    Tuesday, 30 January 2018 2:12 AM
  • If you have problem further, please paste the code samples (or send us through hpcpack@microsoft.com), we could take a check for you.

    Qiufang Shi

    Tuesday, 30 January 2018 5:37 AM
  • Thanks for that!

    I restarted the service, submitted a job (which worked) then tried to cancel it which then hung and returned the below error:

    HTTPError: 500 Server Error: Internal Server Error for url: https://hpc.server/hpc/jobs/79/Cancel
    The logs are available at:

    https://gist.github.com/dhirschfeld/231827bd26f008ccbcfe3362ccac5237

    Doesn't mean much to me, but maybe someone can spot the problem?

    -Dave



    • Edited by dhirschfeld Tuesday, 30 January 2018 5:42 AM
    Tuesday, 30 January 2018 5:41 AM
  • Hi Dave,

    Please collect related HpcWebService logs and send it to hpcpack@microsoft.com. We'll investigate them.

    Thank,
    Zihao

    Tuesday, 30 January 2018 7:03 AM
  • After re-installing the HPC Pack 2016U1 on a new machine I'm still observing the JSON REST api hanging.

    After restarting the HPC Web Service the first attempt to run a job succeeds and any subsequent attempts return a 500 error after a couple of minutes:

    HTTPError: 500 Server Error: Internal Server Error for url: https://headnode/hpc/jobs

    The JSON data I’m using to test is as shown below:

    https://headnode/hpc/jobs: [{'Name': 'Name', 'Value': 'Test'}]

    https://headnode/hpc/jobs/20/tasks: [

        {'Name': 'Name', 'Value': 'echo'},

        {'Name': 'CommandLine', 'Value': 'echo %COMPUTERNAME%'}

    ]


    I've just now emailed the logs in an email entitled "REST API Hangs". Please let me know if there is any further information you need to debug the problem. Any help trading down the problem would be greatly appreciated!

    Thanks,

    Dave


    Tuesday, 20 March 2018 2:42 AM
  • Hi Dave,

    We cannot reproduce your issue locally. Can you connect to the rest server using our sample code? e.g.

    HttpClient httpClient = new HttpClient(new HttpClientHandler(){UseDefaultCredentials = true, ServerCertificateCustomValidationCallback = (a, b, c, d) => true });
    var response = await httpClient.GetAsync("https://headnode" + "/hpc/jobs");

    Also we see a lot of re-connect to scheduler events from the log you sent like

    [WcfProxy] Channel to net.tcp://dc2thhn02:5802/SchedulerStoreServiceInternal faulted.

    Does HpcScheduler service in your cluster work? Is there any other client experiencing difficulty to connect to HpcScheduler service?

    Thanks,
    Zihao


    Tuesday, 20 March 2018 6:01 AM
  • Unfortunately I'm observing the same thing in C# - the first call to the service works correctly and subsequent calls hang until I restart the service.

    Does HpcScheduler service in your cluster work?

    I'm not sure how to test that other than by calling the web api? The Cluster Manager application seems to work fine and lists all the jobs.

    Wondering if this can be some bad interaction with the HpcPortal which I also enabled on 443?

    My next avenue will be to test the xml interface to see if that also has the same problem. Will report back...

    My C# test code is below:

    namespace TestHPC {
    
        [TestFixture]
        public class TestHPC {
    
            public static bool DisableTrustChainEnforecment(object obj, X509Certificate cert, X509Chain chain, SslPolicyErrors err) {
                return true;
            }
    
            [Test]
            public async Task TestAPI() {
                ServicePointManager.ServerCertificateValidationCallback = DisableTrustChainEnforecment;
    
                string result;
                var handler = new HttpClientHandler() {UseDefaultCredentials = true};
                using (var client = new HttpClient(handler)) {
                    client.DefaultRequestHeaders.Add("api-version", "2016-11-01.5.0");
                    client.DefaultRequestHeaders.Accept.Add(
                        new MediaTypeWithQualityHeaderValue("application/json")
                    );
                    
                    var baseURL = "https://headnode";
                    var response = await client.GetAsync(baseURL + "/hpc/jobs");
                    response.EnsureSuccessStatusCode();
                    result = await response.Content.ReadAsStringAsync();
                }
    
            }
        }
    }

    Tuesday, 20 March 2018 6:49 AM
  • Hi Dave,

    Web portal and web api should work fine together.

    Please do tell us if there is any finding in the test of xml interface.

    Thanks,
    Zihao

    Tuesday, 20 March 2018 7:14 AM
  • After some further testing it appears that the xml endpoint /WindowsHPC/Jobs works correctly and doesn't hang.

    The xml endpoint /WindowsHPC/Jobs can only be connected to with basic authentication whereas the json endpoint /hpc/jobs can only be connected to with negotiate (SSPI) auth therefore I suspect there's some problem with the authentication in the HPC Web Service which causes it to hang. 

    Is there some way to enable basic authentication for the json endpoint so I can proceed with that? If that does in fact work correctly on my system then we will have at least pin-pointed the underlying issue. 

    If anyone has any suggestion of further testing I can do - please let me know!  Thanks, Dave


    • Edited by dhirschfeld Wednesday, 21 March 2018 5:47 AM
    Wednesday, 21 March 2018 5:44 AM
  • The xml test code:

            [Test]
            public async Task TestXML() {
                ServicePointManager.ServerCertificateValidationCallback = DisableTrustChainEnforecment;
    
                string result;
                var handler = new HttpClientHandler() {UseDefaultCredentials = true};
                using (var client = new HttpClient(handler)) {
                    client.DefaultRequestHeaders.Add("api-version", "2016-11-01.5.0");
                    client.DefaultRequestHeaders.Accept.Add(
                        new MediaTypeWithQualityHeaderValue("application/xml")                    
                    );
                    var username = @"DOMAIN\user";
                    var password = "******";
                    var encoded = Convert.ToBase64String(Encoding.ASCII.GetBytes($"{username}:{password}"));
                    var auth = new AuthenticationHeaderValue("Basic", encoded);
                    client.DefaultRequestHeaders.Authorization = auth;
    
                    var baseURL = "https://headnode";
                    var response = await client.GetAsync(baseURL + "/WindowsHPC/Jobs");
                    response.EnsureSuccessStatusCode();
                    result = await response.Content.ReadAsStringAsync();
                }
    
            }



    • Edited by dhirschfeld Wednesday, 21 March 2018 5:55 AM
    Wednesday, 21 March 2018 5:55 AM
  • Hi Dave,

    For now the json endpoint only supports NTLM/negotiate and AAD authentication. There is no basic authentication for it.

    It is very likely that it is the authentication part which causing your issue. If you are willing to, we can provide some private bits which enables more log to trace the issue down in your environment.

    Thanks,
    Zihao

    Wednesday, 21 March 2018 6:17 AM
  • Hi Zihao,

    I'd be happy to help track down the problem - feel free to contact me by email to let me know what needs to be done.

    Could I also request that basic authentication be enabled for the JSON endpoint too? It's useful to be able to quickly and easily test the api in the browser as well as developing with clients which don't support SSPI auth - e.g. the VS Code REST Client or recently in Python there was a bug in the SSPI library on py36.

    Thanks,

    Dave

    Wednesday, 21 March 2018 11:15 PM
  • Hi Dave,

    Thank you for your assistant. I've sent you a email about how to track the issue down. Also we will investigate about adding basic authentication to new endpoint soon.

    Thanks,
    Zihao

    Thursday, 22 March 2018 6:06 AM
  • Hi Dave,

    We have found the potential cause. It is fixed in the latest QFE for Update 1.

    Please get the QFE at Download Center and check if your issue is resolved.

    Thanks,
    Zihao

    Tuesday, 29 May 2018 2:22 AM
  • Thanks for the update Zihao! I'll definitely test and report back any findings. It likely won't be until next week though given my current workload...
    Tuesday, 29 May 2018 2:30 AM
  • Unfortunately, after updating to the latest version I'm still seeing the web service hanging. Have emailed further details.

    Happy to debug further if there is any more I can do on my side...

    Cheers,

    Dave

    Thursday, 21 June 2018 3:21 AM