locked
Workbook offloading throws "unreadable content" when the # of cores increased to like 20 or more RRS feed

  • Question

  • I'm working on a project that migrating some workbook(s) on Windows Server 2008 R2 HPC Enterprise edition with SP3. I use workbook offloading mode as most of the calc logic is included in the sheets/cells and the links. Some logics are defined in excel add-ins (.xla files) but they are not complicated so I decided to NOT move them into XLL working as UDF offloading.

    The calculation is distributed at workbook level, meaning to say, we fill in/change some data in the workbook as a request and run on the cluster then retrieve the results back, and the calculation requires thousands of this kind of work, so distribution at workbook level is the best choice I think.

    Everything worked fine when the # of core is 6 or 8, in order to do some tests, I increased the # or cores by increasing SubscribedCore/SubscribedSockets, then I had 32 cores. The Excel job submitted to the cluster worked fine at the beginning of the job lifecycle, and after retrieved some results(included in the calculation responses), the calculation seemed stopped and after a while (a min or two) I got error thrown from server side, saying

    "System.IO.IOException: Found unreadable content in the workbook. Verify that \\headnodemachine\shared\folder\CalcWorkbook.xlsb can be opened manually. --> System.Runtime.InteropServices.COMException (0x800A03EC)"

    Looking at the logs in Event Viewer under Microsoft\HPC\Excel\HPC Excel Admin Events, there were a lot of such errors as

    Error while starting Excel process: System.Runtime.InteropServices.COMException (0x80080005): Retrieving the COM class factory for component with CLSID {00024500-0000-0000-C000-000000000046} failed due to the following error: 80080005.
       at Microsoft.Hpc.Excel.ExcelDriver.LaunchExcelProcess()
       at Microsoft.Hpc.Excel.ExcelDriver.OpenWorkbook(String filePath, Boolean updateLinks, String password, String writeResPassword, Nullable`1 lastSaveDate)

    Session 114, Task 1 - ExcelService failed to handle a request on RTG23REM09. The failure was caused by a configuration, installation, or resource accessibility problem: System.Runtime.InteropServices.COMException (0x80080005): Retrieving the COM class factory for component with CLSID {00024500-0000-0000-C000-000000000046} failed due to the following error: 80080005.
       at Microsoft.Hpc.Excel.ExcelDriver.OpenWorkbook(String filePath, Boolean updateLinks, String password, String writeResPassword, Nullable`1 lastSaveDate)
       at Microsoft.Hpc.Excel.ExcelDriver.OpenWorkbook(String filePath, Nullable`1 lastSaveDate)
       at Microsoft.Hpc.Excel.ExcelService.OpenWorkbook(String workbookPath, Nullable`1 lastSaveDate)
       at Microsoft.Hpc.Excel.ExcelService.Calculate(String macroName, Byte[] inputs, Nullable`1 lastSaveDate)

    I googled above error and most of the articles said that was a security issue, setting the permission in Component service would fix this. But this doesn't sound like the issue I encountered, am still encountering because some of the calculation requests got handled and there were calculation responses/results returned, for every job using 30 or 60 cores, 80~90% of the calc requests actually were handled.

    Is there a limit of # of cores for HPC Service for Excel?

    Did anyone see this before? Any idea?

    Tuesday, June 30, 2015 2:36 AM

Answers

  • Currently there is another way to solve this issue. Instead of under-subscribe the cores on the compute nodes, we could also run the Excel workbook under active sessions on the compute nodes, which should work fine with many cores and service hosts. Just follow the below steps,

    1. Open HPC Cluster Manager on the head node, click Configuration, and then click Services

    2. Double click on the service named 'Microsoft.Hpc.Excel.ExcelService' and in the opened configuration file ‘Microsoft.Hpc.Excel.ExcelService_1.0.xml’ add the following,

        <microsoft.Hpc.Session.ServiceRegistration>
        <service assembly="%CCP_HOME%Bin\Microsoft.Hpc.Excel.ExcelService.dll"
             contract="Microsoft.Hpc.Excel.IExcelService"
                 type="Microsoft.Hpc.Excel.ExcelService"
                 includeExceptionDetailInFaults="true"
                 maxConcurrentCalls="1"
                 serviceInitializationTimeout="60000"
                 maxMessageSize="3665536"
                     >
          <environmentVariables>
            <add name="HPC_ATTACHTOSESSION" value="Try"/>
          </environmentVariables>
        </service>
      </microsoft.Hpc.Session.ServiceRegistration>

    3. From the Cluster Manager, click Node Management, in the node list, choose all the compute nodes for Excel, right click and choose Remote Desktop from the context menu. RDP to all the nodes using the user credential under which the Excel workbook to run. The would create an active interactive session for the Excel work to launch (it can be observed when the Excel job runs on the nodes).

    Then you may run the Excel workbook on this cluster and the error for unreadable content should be gone.

    If any further questions, please let us know.

    BR,

    Yutong

    • Marked as answer by MChen19th Tuesday, August 25, 2015 2:08 PM
    Friday, August 7, 2015 4:04 AM

All replies

  • There is no hard limit of # of cores for HPC Excel Service.

    Could you try to move the workbook from the headnode fileshare to a local folder on the compute node, set it to read only and retry?

    BR,

    Yutong

    Tuesday, June 30, 2015 2:11 PM
  • No, it didn't work, I copied the workbook under a folder D:\shared\folder\ on each compute node and set the workbook path to be "D:\shared\folder\CalcWorkbook.xlsb" when start the session.

    The problem remained, some requests got processed and results got returned to client side but some were not, there were still "unreadable content" error logs on compute node(s).
    • Edited by MChen19th Wednesday, July 1, 2015 1:24 AM
    Wednesday, July 1, 2015 1:23 AM
  • Did anyone report this before?

    Even try with the "SimpleWorkbook.xls" from microsoft I cand see this issue happens...

    Friday, July 3, 2015 2:41 AM
  • This issue is caused by too many service hosts trying to open the same workbook at once. The number of service hosts equals to the number of SOA tasks running on the node and normally equals the numbers of cores of the node. If the cores of the node is oversubscribed e.g. from 8 to 32. Then there would be 32 service hosts opening the same workbook at the same time instead of 8. Some service hosts may fail with the COMException (0x800A03EC) when opening the workbook.

    It is also not recommened to use core over subscription feature with the Excel workbook run, for the Excel workbook would be opened for each service host / core responding each SOA request to invoke VBA Macros. Due to the CPU cycle limit on the cores, oversubscribing may not improve the overall request throughput, instead this causes additional system resource consumption which may cause application errors.

    My best,

    Yutong

    Thursday, July 9, 2015 4:16 AM
  • Yutong,

    Appreciate you replied this.

    I can understand that many service hosts running on the nodes and opening the workbook at the same time. Incresaing the subscribed cores on a node is just for testing, I actually tried this on a node that has 24 physical cores and the issue remained, we have a cluster has over 600 physical cores, so I guess now I cannot use all the resource of the cluster to speed up the excel workbook calculation.

    Microsoft introduced the workbook offloading mode for running excel workbooks on a cluster so I originally thought this would not be a problem, although oversubscribing cores may slowdown the performance, but not such COM exceptions.

    Previously I thought this would be managed by the HPC service together with the WCF broker, the broker has queue that can cache the large number of requests it retrieves and then dispatches them one after another, if there is such issue of opening the excel workbook, then it should wait...

    Probably for now I should try limiting the number of requests sending to the cluster at the same time, if a request takes 50s to do the calcualtion, then I can send for example, 1 request per second, instead of dozens/hundreds at the same time . 

    Friday, July 10, 2015 3:03 AM
  • Hi MChen19th,

    Limiting the number of requests won't help because the root cause is opening too many service hosts on a single  node. Each service host would try to open the same Excel workbook under the non-interactive session, which is not stable. To work around this issue, you may try to under-subscribe the cores on the compute nodes if the number of cores on each compute node is beyond a certain number e.g. 12. depending on the machine specs. Here are the example steps:

    1.    Open a Windows powershell cmdlet and run Add-PsSnapin Microsoft.HPC
    2.    Run Get-HpcNode –Name <ComputeNodeName> | Set-HpcNodeState –State Offline | Set-HpcNode -SubscribedCores 12 | Set-HpcNodeState –State Online

    If you are using the compute node for other job types besides Excel workbook. You can also use SubcribedSockets instead of SubcribedCores and change the VBA in the workbook with the example below to use socket as resource type:
    HPCExcelClient.OpenSession headNode:=HPC_ClusterScheduler, remoteWorkbookPath:=HPCWorkbookPath, resourceType:=SessionUnitType.SessionUnitType_Socket, jobTemplate:=HPC_JobTemplate
    For details about how to change the subcribed cores/sockets, please also refer here.

    We are also planning to address this issue in the QFEs for HPC Pack 2012 R2 Update 2. Just stay tuned.

    BR,

    Yutong

    Thursday, July 23, 2015 8:36 AM
  • Currently there is another way to solve this issue. Instead of under-subscribe the cores on the compute nodes, we could also run the Excel workbook under active sessions on the compute nodes, which should work fine with many cores and service hosts. Just follow the below steps,

    1. Open HPC Cluster Manager on the head node, click Configuration, and then click Services

    2. Double click on the service named 'Microsoft.Hpc.Excel.ExcelService' and in the opened configuration file ‘Microsoft.Hpc.Excel.ExcelService_1.0.xml’ add the following,

        <microsoft.Hpc.Session.ServiceRegistration>
        <service assembly="%CCP_HOME%Bin\Microsoft.Hpc.Excel.ExcelService.dll"
             contract="Microsoft.Hpc.Excel.IExcelService"
                 type="Microsoft.Hpc.Excel.ExcelService"
                 includeExceptionDetailInFaults="true"
                 maxConcurrentCalls="1"
                 serviceInitializationTimeout="60000"
                 maxMessageSize="3665536"
                     >
          <environmentVariables>
            <add name="HPC_ATTACHTOSESSION" value="Try"/>
          </environmentVariables>
        </service>
      </microsoft.Hpc.Session.ServiceRegistration>

    3. From the Cluster Manager, click Node Management, in the node list, choose all the compute nodes for Excel, right click and choose Remote Desktop from the context menu. RDP to all the nodes using the user credential under which the Excel workbook to run. The would create an active interactive session for the Excel work to launch (it can be observed when the Excel job runs on the nodes).

    Then you may run the Excel workbook on this cluster and the error for unreadable content should be gone.

    If any further questions, please let us know.

    BR,

    Yutong

    • Marked as answer by MChen19th Tuesday, August 25, 2015 2:08 PM
    Friday, August 7, 2015 4:04 AM
  • Yutong,

    Your solution fixed the problem on my side. I have a test cluster which has compure node, head node and broker node on the same machine, so I cannot use the RDP from the HPC Manager as you described. Instead, I used the the credential under which the Excel job run runs to log on the node by using RDP and then submit a job with 24 cores, no "unreadable content" error any more. I'll find a cluster with more cores and have a try.

    Thanks so much!


    • Edited by MChen19th Monday, August 31, 2015 8:20 AM
    Tuesday, August 25, 2015 8:48 AM