none
jobs stuck in validation stage RRS feed

  • Question

  • We have a problem that jobs are stuck in the validation stage and the SQL database error out with the error message:

    "The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information."

    We experienced the problem for the first time roughly a month after applying the two fixes to HPC Pack 2012 R2 Update 3.  We have tried to configure a completely new headnode and compute nodes with HPC Pack 2016 Update 2 + KB4481650, but the problem persists. We are running Microsoft SQL Server 2016 (64-bit) on the single headnode.

    The resources (CPU, RAM, diskspace) on the headnode do not seem to be over-utilized, and our SQL expert say the same applies to the database instance but that the query is not well designed(?)

    The job we are trying to validate has 1 node preparation task and 65280 basic tasks, and we have previously been running them without issues.

    Monday, June 17, 2019 2:55 PM

Answers

All replies

  • We have now found out that this error is caused by KB3189996 to HPC Pack 2012 R2 Update 3. This seems to have changed the some SQL queries in a way that persists in HPC Pack 2016 Update 2 and is currently preventing us from updating beyond the original HPC Pack 2012 R2 Update 2.

    It seems like the difference is that before KB3189996 it was approaching the SQL Server with small UPDATE statements for each task ID like this:

    UPDATE TaskGroup SET  Total=Total+0 ,Configuring=Configuring+0 WHERE Id=0 
    UPDATE TaskGroup SET  Total=Total+1 ,Configuring=Configuring+1 WHERE Id=184365 
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+1 ,SumMaxNodes=SumMaxNodes+1 ,SumMaxSockets=SumMaxSockets+1 WHERE Id=0
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+1 ,SumMaxNodes=SumMaxNodes+1 ,SumMaxSockets=SumMaxSockets+1 WHERE Id=184365 

    UPDATE TaskGroup SET  Submitted=Submitted-1 ,Queued=Queued+1 WHERE Id=0 
    UPDATE TaskGroup SET  Queued=Queued+1 ,Submitted=Submitted-1 WHERE Id=184365 
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+0 ,SumMaxNodes=SumMaxNodes+0 ,SumMaxSockets=SumMaxSockets+0 WHERE Id=184365 

    UPDATE TaskGroup SET  Total=Total+0 ,Configuring=Configuring+0 WHERE Id=0 
    UPDATE TaskGroup SET  Total=Total+1 ,Configuring=Configuring+1 WHERE Id=184366 
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+1 ,SumMaxNodes=SumMaxNodes+1 ,SumMaxSockets=SumMaxSockets+1 WHERE Id=0 
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+1 ,SumMaxNodes=SumMaxNodes+1 ,SumMaxSockets=SumMaxSockets+1 WHERE Id=184366 

    UPDATE TaskGroup SET  Submitted=Submitted-1 ,Queued=Queued+1 WHERE Id=0 
    UPDATE TaskGroup SET  Queued=Queued+1 ,Submitted=Submitted-1 WHERE Id=184366 
    UPDATE TaskGroup SET  SumMaxCores=SumMaxCores+0 ,SumMaxNodes=SumMaxNodes+0 ,SumMaxSockets=SumMaxSockets+0 WHERE Id=184366 

    This same pattern then continue for additional ID’s.

    After KB3189996 

    Comparing this to the UPDATE statement where error 8623 encountered the UPDATE was targeted towards several ID’s looked up with “RecordId IN” in the WHERE clause and thus failing when there was too many values (>10000).

    UPDATE TaskGroup SET Queued=Queued+1*Change,Submitted=Submitted-1*Change 
    FROM TaskGroup  INNER JOIN(      SELECT GroupId, COUNT(GroupId) as Change FROM TaskDetail     
    WHERE RecordId IN (114597884,114597885,114597886,114597887,114597888,114597889,114597890,114597891,114597892,114597893,114597894,114597895,114597896,114597897,114597898,114597899,114597900,114597901,114597902,114597903,114597904,114597905,114597906,114597907,114597908,114597909,114597910,114597911,114597912,114597913,114597914,114597915,…


    Tuesday, June 25, 2019 11:47 AM
  • Thanks for the information. The issue is addressed.

    Chenling


    Thursday, June 27, 2019 12:17 PM