locked
Intermittent task failure with "Error from node: 'The parameter is incorrect' reported creating the task." RRS feed

  • Question

  • I am using HPC Pack 2012 R2 4.2.4400.0 and I've had this intermittent error a few times where a command line task will fail with the following:

    Error from node: [MYCOMPUTENODE]:Microsoft.Hpc.Activation.NodeManagerException: Exception 'The parameter is incorrect' reported creating the task.

    Server stack trace: 
    at Microsoft.Hpc.NodeManager.RemotingExecutor.RemotingNMExecImpl.StartTask(Int32 jobId, Int32 taskId, ProcessStartInfo startInfo)
    at Microsoft.Hpc.NodeManager.RemotingCommunicator.RemotingNMCommImpl.StartTask(Int32 jobId, Int32 taskId, ProcessStartInfo startInfo)
    at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Object[]& outArgs)
    at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg)

    Exception rethrown at [0]
    at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
    at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
    at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeController.StartTaskWorker.EndInvoke(IAsyncResult result)
    at Microsoft.Hpc.Scheduler.Communicator.Remoting.NodeController.AsyncContext`1.EndCall(IAsyncResult result)

    It always seems to be the first task in a job, and it seems to fix itself. Subsequent tasks that run on that same node will work OK without any intervention.

    Any idea what the cause might be?



    • Edited by TimJRoberts1 Wednesday, November 4, 2015 10:35 AM
    Wednesday, November 4, 2015 10:31 AM

All replies

  • Hi Tim,

      This is a wrapper error message that send from the compute node. Would you please check the logs on the compute node and share here? There should be a call stack error message there.

    To check the logs on the compute node:

    - Go to compute node

    - cd to %CCP_HOME%LogFiles\scheduler

    - Run: HPCLog parselog hpcnodemanager_*.bin

    You can select the bin file that contains the logs (The bin file logs work like this: the latest with the biggest number will always be empty. The logs are written to the file with the number next to the biggest).


    Qiufang Shi


    Wednesday, November 4, 2015 11:43 AM
  • See below. In this instance a job had 5 tasks that all executed on the same computenode, and 1 of them (JobId 6553, TaskId 772558) failed with "Error from node: Exception 'The parameter is incorrect' reported creating the task."


    05:14:26.107, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfac9e417b" String1="Creating job entry for job ID 6553, user [MYUSER]" 
    05:14:26.107, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfac9e417b" String1="Start Task JobId 6553, TaskId 772558." 
    05:14:26.107, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfac9e417b" String1="Creating process for JobId 6553, TaskId 772558 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769994" 
    05:14:30.263, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf18760b" String1="The head node requested a cancel of job 6553" 
    05:14:30.263, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf18760b" String1="JobId 6553, TaskId 772558 Being terminated for cancelation or cleanup" 
    05:14:30.372, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf292696" String1="Sharing existing JobEntry" 
    05:14:30.497, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf3c3975" String1="Start Task JobId 6553, TaskId 772558." 
    05:14:30.497, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf3c3975" String1="Exception creating new task_ JobId 6553, TaskId 772558:The job identifier 6553 is invalid." 
    05:14:30.544, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf4360a4" String1="Start Task JobId 6553, TaskId 772558." 
    05:14:30.544, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf4360a4" String1="Exception creating new task_ JobId 6553, TaskId 772558:The job identifier 6553 is invalid." 
    05:14:30.560, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf45c301" String1="The head node requested a cancel of job 6553" 
    05:14:30.685, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf58d5d3" String1="Creating job entry for job ID 6553, user [MYUSER]" 
    05:14:30.685, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf58d5d3" String1="Start Task JobId 6553, TaskId 772558." 
    05:14:30.685, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf58d5d3" String1="Creating process for JobId 6553, TaskId 772558 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769994" 
    05:14:30.685, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfaf58d5d3" String1="Exception creating new task_ JobId 6553, TaskId 772558:The parameter is incorrect" 
    05:14:30.747, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf625f5f" String1="Start Task JobId 6553, TaskId 772559." 
    05:14:30.747, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfaf625f5f" String1="Creating process for JobId 6553, TaskId 772559 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769995" 
    05:14:34.154, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15828" TS="0x01d116bfb16a228d" String1="Notified that NMProxy has exited for JobId 6553, TaskId 772559" 
    05:14:34.154, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15828" TS="0x01d116bfb16a228d" String1="JobId 6553, TaskId 772559 Being terminated for cancelation or cleanup" 
    05:14:34.154, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15828" TS="0x01d116bfb16a228d" String1="Job 6553, Task 772559, returned Exit Code 0" 
    05:14:34.154, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15828" TS="0x01d116bfb16a228d" String1="Cancel JobId 6553, TaskId 772559" 
    05:14:34.201, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfb1714998" String1="Start Task JobId 6553, TaskId 772560." 
    05:14:34.201, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="6356" TS="0x01d116bfb1714998" String1="Creating process for JobId 6553, TaskId 772560 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769996" 
    05:14:35.716, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfb2588ec8" String1="Sharing existing JobEntry" 
    05:14:35.716, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfb2588ec8" String1="Start Task JobId 6553, TaskId 772561." 
    05:14:35.716, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfb2588ec8" String1="Sharing existing JobEntry" 
    05:14:35.732, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfb25af125" String1="Creating process for JobId 6553, TaskId 772561 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769997" 
    05:14:35.732, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfb25af125" String1="Start Task JobId 6553, TaskId 772562." 
    05:14:35.732, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="13052" TS="0x01d116bfb25af125" String1="Creating process for JobId 6553, TaskId 772562 with the command line C:\windows\system32\cmd.exe /S /c "[MYAPP.exe] 769998" 
    05:14:50.279, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="4732" TS="0x01d116bfbb06a959" String1="Notified that NMProxy has exited for JobId 6553, TaskId 772562" 
    05:14:50.279, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="4732" TS="0x01d116bfbb06a959" String1="JobId 6553, TaskId 772562 Being terminated for cancelation or cleanup" 
    05:14:50.279, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="4732" TS="0x01d116bfbb06a959" String1="Job 6553, Task 772562, returned Exit Code 0" 
    05:14:50.279, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="4732" TS="0x01d116bfbb06a959" String1="Cancel JobId 6553, TaskId 772562" 
    05:15:02.389, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15928" TS="0x01d116bfc23e70f6" String1="Notified that NMProxy has exited for JobId 6553, TaskId 772561" 
    05:15:02.389, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15928" TS="0x01d116bfc23e70f6" String1="JobId 6553, TaskId 772561 Being terminated for cancelation or cleanup" 
    05:15:02.389, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15928" TS="0x01d116bfc23e70f6" String1="Job 6553, Task 772561, returned Exit Code 0" 
    05:15:02.389, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15928" TS="0x01d116bfc23e70f6" String1="Cancel JobId 6553, TaskId 772561" 
    05:15:13.592, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="11880" TS="0x01d116bfc8ebef61" String1="Notified that NMProxy has exited for JobId 6553, TaskId 772560" 
    05:15:13.592, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="11880" TS="0x01d116bfc8ebef61" String1="JobId 6553, TaskId 772560 Being terminated for cancelation or cleanup" 
    05:15:13.592, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="11880" TS="0x01d116bfc8ebef61" String1="Job 6553, Task 772560, returned Exit Code 0" 
    05:15:13.592, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="11880" TS="0x01d116bfc8ebef61" String1="Cancel JobId 6553, TaskId 772560" 
    05:15:13.608, SrcFile="HpcNodeManager" SrcFunc="" SrcLine="0" Pid="3836" Tid="15836" TS="0x01d116bfc8ee51cc" String1="The head node requested a cancel of job 6553" 
    05:15:17.780, SrcFile="memorylog.cpp" SrcFunc="CheckDumpMemoryLogCounters" SrcLine="52" Pid="3836" Tid="13744" TS="0x01d116bfcb6ae8c6" String1="TimeIntervalMs=60172,EntriesProcessed=43, BytesProcessed=12786, MaxQueuedBytes=3584" 



    • Edited by TimJRoberts1 Wednesday, November 4, 2015 3:50 PM
    Wednesday, November 4, 2015 3:43 PM
  • Hi Tim,

    We've noticed this error. It seems that your job is preempted by other jobs frequently, and the task was sent to a node after the job being removed from it.

    We mitigated this error by just retrying sending the task to other nodes or create the job again on this node. So it should recover from that state.

    As long as the task can complete normally, you can treat this error message as benign.

    Evan

    Thursday, November 5, 2015 6:41 AM
  • Hi Evan

    I'm using Scheduling: Balanced, Pre-emption: Graceful. Would you still expect to see this since I'm not using immediate pre-emption?

    Are you suggesting that if I ever receive the error: "Exception 'The parameter is incorrect' reported creating the task" then I should catch this and resubmit the task? Or can it be handled within HPC somehow?

    Thanks

    Tim

    Thursday, November 5, 2015 9:07 AM
  • Hi Tim,

    Could you let me know the final state of the task? Does succeed or fail?

    Thanks,
    Evan

    Thursday, November 5, 2015 9:34 AM
  • The task fails.
    Thursday, November 5, 2015 12:30 PM
  • OK, did you see it get automatically retried? If no, please send me the scheduler logs under %CCP_HOME%\Data\LogFiles\Scheduler, and also the job id.

    You can send a mail to evanc@microsoft.com, so I will let you know how to transfer the logs.

    Thanks,

    Evan

    Thursday, November 5, 2015 2:05 PM
  • Hi Evan,

    We have faced the same issue on our environment, the task was not automatically retried. Could you please get me a latest update on this issue, if you have any fix please let me know.

    Regards,

    Thendralvanan.

    Monday, April 4, 2016 2:20 PM
  • Hi Thendralvanan,

    We need check the log to investigate the issue. Please send an email to the email address mentioned above. I will let you know how to transfer the logs to us.

    Thanks,
    Evan

    Tuesday, April 5, 2016 6:28 AM