none
多个进程同时调用NtResumeProcess和CreateToolhelp32Snapshot可能触发系统内核卡死 RRS feed

  • 问题

  • 之前我们客户的服务器上发生了多次kvm虚拟机卡死事件,经过调查初步怀疑是多个进程同时调用NtResumeProcess和CreateToolhelp32Snapshot时,可能触发。

    客户的操作系统是 Windows Server2012 R2,已经更新最新的补丁。

    下面以其中一个dump为例分析,该dump文件可以从以下地址下载
    https://1drv.ms/u/s!AlkuEARwQ72ygWuXhNPd6dWcQEgn?e=YoCnFY

    通过执行 !pcr 0/1/2/3 可以看到四个核心上当前执行的线程分别是:

    cpu 0: ffffe001ccaf7080
    cpu 1: ffffe001c7f9a500
    cpu 2: ffffe001c808f080
    cpu 3: ffffe001c7788080

    查看线程 ffffe001ccaf7080

    32.0: kd> !thread ffffe001ccaf7080
    THREAD ffffe001ccaf7080  Cid 4488.1f5c  Teb: 00007ff6c275c000 Win32Thread: 0000000000000000 RUNNING on processor 0
    Not impersonating
    DeviceMap                 ffffc001a340c310
    Owning Process            ffffe001ce3885c0       Image:         TitanMonitor.exe
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      106035470      Ticks: 70021 (0:00:18:14.078)
    Context Switch Count      275406520      IdealProcessor: 0             
    UserTime                  00:03:42.343
    KernelTime                00:47:23.265
    Win32 Start Address 0x00007ffd549e4a30
    Stack Init ffffd000d4b50dd0 Current ffffd000d4b50110
    Base ffffd000d4b51000 Limit ffffd000d4b4b000 Call 0000000000000000
    Priority 15 BasePriority 15 PriorityDecrement 0 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    00000000`00000000 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x0



    由于是用qemu抓取的内存dump,无法看到真实的堆栈信息,需要手动恢复堆栈,先切换到线程所在的进程,再恢复堆栈

    32.0: kd> .process /r /p ffffe001ce3885c0
    32.0: kd> kb = ffffd000`d4b50660 fffff803`74c206b0 20
     # RetAddr           : Args to Child                                                           : Call Site
    00 fffff803`74c1fb4c : fffff803`74c63c70 00000000`0000002f ffffd000`79517350 ffffd000`d4b50700 : hal!HalpApicRequestInterrupt+0x100
    01 fffff803`74dcaddc : ffffe001`c9167880 ffffc001`b48cd330 00000000`4592db88 00000000`00000000 : hal!HalRequestSoftwareInterrupt+0xdd
    02 fffff803`74d64560 : ffffe001`c9167880 ffffd8f0`004efbc4 00000000`00000000 00000000`00000000 : nt!KiInterruptDispatchNoLockNoEtw+0xbc
    03 fffff803`74d41759 : ffffe001`000000fd ffffe001`c9167880 fffff803`74f69100 00000000`00000003 : nt!KiResumeThread+0x230
    04 fffff803`750e0de0 : ffffe001`00000001 ffffe001`ce136900 ffffe001`c9c7d240 00000000`00000000 : nt!KeResumeThread+0x79
    05 fffff803`752866bb : 00000000`00000000 00000000`00000000 ffffd000`d4b50cc0 fffff803`750e5601 : nt!PsResumeProcess+0x54
    06 fffff803`74dd92a3 : ffffe001`ccaf7080 ffffe001`ce136900 00000000`00369e99 ffffe001`ca4bc7d0 : nt!NtResumeProcess+0x4f
    07 00007ffd`65f71d1a : 00007ff6`c276fbb4 00000000`00000000 00007ffd`631f158a 00000000`00369e99 : nt!KiSystemServiceCopyEnd+0x13
    08 00007ff6`c276fbb4 : 00000000`00000000 00007ffd`631f158a 00000000`00369e99 00000000`00000000 : ntdll!NtResumeProcess+0xa
    09 00007ffd`549e4b64 : 00007ff6`c2771850 00000000`00000000 00000000`00000000 00000000`00000000 : TitanMonitor!ProcessManager::LimitAgentCPU+0xa4 [d:\jenkens_slave\workspace\agent-release-release\win_custom\vendor-v3.3.0_v3.58.19\agent-monitor\src\windows\process_limit.cpp @ 252]
    0a 00007ffd`65c213f2 : 00007ffd`549e4a30 00000000`00000000 00000000`00000000 00000000`00000000 : WINMM!timeThread+0x138
    0b 00007ffd`65ef54f4 : 00007ffd`65c213d0 00000000`00000000 00000000`00000000 00000000`00000000 : KERNEL32!BaseThreadInitThunk+0x22
    0c 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x34



    通过反汇编,查看 nt!KiResumeThread+0x230 处的代码,发现是在等待线程的ThreadLock

      while ( 1 )
      {
        __asm { lock bts qword ptr [rbx+40h], 0 }   // lock _KTHREAD.ThreadLock
        if ( !_CF )
          break;
        do
        {
          v7 += v3;
          if ( v7 & HvlLongSpinCountMask || !(HvlEnlightenments & 0x40) )
          {
            _mm_pause();
          }
          else
          {
            HvlNotifyLongSpinWait(v7);
            v3 = 1;
          }
        }
        while ( *(_QWORD *)(_RBX + 0x40) );         // KiResumeThread+0x230, 等待_KTHREAD.ThreadLock释放



    查看线程 ffffe001`c9167880 的ThreadLock 值是1,说明已经被别的地方锁住,当前线程无法获取ThreadLock,陷入死等状态。

    32.0: kd> dt _KTHREAD ffffe001`c9167880
    ntdll!_KTHREAD
       +0x000 Header           : _DISPATCHER_HEADER
       +0x018 SListFaultAddress : (null)
       +0x020 QuantumTarget    : 0x0000002d`eccefaff
       +0x028 InitialStack     : 0xffffd000`d5943dd0 Void
       +0x030 StackLimit       : 0xffffd000`d593e000 Void
       +0x038 StackBase        : 0xffffd000`d5944000 Void
       +0x040 ThreadLock       : 1
       +0x048 CycleTime        : 0x0000002d`da665f0a




    查看 nt!KeResumeThread+0x79 处的代码,发现在调用KiResumeThread之前有提升IRQL=DISPATCH_LEVEL的操作,导致当前cpu不会切换线程。

      v2 = a1;
      v3 = __readcr8();
      v84 = v3;
      __writecr8(2ui64);                            // 提升IRQL=DISPATCH_LEVEL
      _RSI = *MK_FP(__GS__, 32i64);
      v5 = 0;
      v6 = 1;
      if ( _interlockedbittestandset((volatile signed __int32 *)(a1 + 736), 7u) )
      {
        do
        {
          if ( ++v5 & HvlLongSpinCountMask || !(HvlEnlightenments & 0x40) )
            _mm_pause();
          else
            HvlNotifyLongSpinWait(v5);
        }
        while ( (char)*(_DWORD *)(v2 + 736) < 0 || _interlockedbittestandset((volatile signed __int32 *)(v2 + 736), 7u) );
      }
      v7 = *(_BYTE *)(v2 + 644);
      v86 = *(_BYTE *)(v2 + 644);
      if ( *(_BYTE *)(v2 + 644) )
      {
        v8 = *(_BYTE *)(v2 + 644) - 1;
        *(_BYTE *)(v2 + 644) = v8;
        if ( !v8 && !(*(_DWORD *)(v2 + 120) & 0x2000) )
          KiResumeThread(v2, _RSI, 0i64);           // call KiResumeThread
      }


    同样的过程分析cpu 1

    32.0: kd> .process /r /p ffffe001c7f5e900
    32.0: kd> kb = ffffd000`d5779170 fffff803`74c206b0 30
    # RetAddr           : Args to Child                                                           : Call Site
    00 fffff803`74c1fb4c : ffffe001`c64f82d0 00000000`0000002f 00000000`00000000 00000000`00000000 : hal!HalpApicRequestInterrupt+0x100
    01 fffff803`74dcaddc : ffffe001`c9167880 fffff803`74dcecf5 ffffd000`d57796a0 fffff803`74dcadcf : hal!HalRequestSoftwareInterrupt+0xdd
    02 fffff803`74cdb030 : 00000000`04057000 ffffe001`c8ce0a20 ffffe001`ce136900 00000000`00000001 : nt!KiInterruptDispatchNoLockNoEtw+0xbc
    03 fffff803`7504385d : ffffe001`ce136900 ffffe001`c7f9a500 00000000`03b42a90 00000000`00000000 : nt!KeQueryValuesThread+0x220
    04 fffff803`7511e569 : 00000002`00000000 fffff803`0001fc0f 00000000`00000001 00000000`00000000 : nt!ExpGetProcessInformation+0x17d
    05 fffff803`750e38f5 : 00000000`03b310b0 fffff960`000c8911 00000000`00000005 00000000`00000003 : nt!ExpQuerySystemInformation+0x1035
    06 fffff803`74dd92a3 : ffffe001`c7f9a500 00000000`00000000 00000000`7f179000 ffffe001`c7f9e990 : nt!NtQuerySystemInformation+0x49
    07 00007ffd`65f70aba : 00000000`77d5558d feeefeee`feeefeee 00000000`02e50100 00000000`039de9c0 : nt!KiSystemServiceCopyEnd+0x13
    08 00000000`77d5558d : feeefeee`feeefeee 00000000`02e50100 00000000`039de9c0 00000000`00000005 : ntdll!NtQuerySystemInformation+0xa
    09 00000000`77d517ff : 00000000`02fb8768 00000000`02fb8768 00000000`0001f80f 00000000`039dde50 : wow64!whNT32QuerySystemProcessInformationEx+0x8d
    0a 00000000`77d6b225 : 00000000`03a1fb9c 00000000`77dbc00b 00000000`00000000 00000000`7f177000 : wow64!whNtQuerySystemInformation_SpecialQueryCase+0x7f7
    0b 00000000`77d49c7b : 00000000`00021cff 00000000`77d49980 00000000`7f179000 00000000`7f177000 : wow64!whNtQuerySystemInformation+0xa5
    0c 00000000`77d91dc5 : 00000023`77e4c65c 00000000`00000023 00000000`00000002 00000000`0001f80f : wow64!Wow64SystemServiceEx+0xfb
    0d 00000000`77d5235a : 00000000`00000000 00000000`77d91574 00000000`00000000 00000000`77d52540 : wow64cpu!ServiceNoTurbo+0xb
    0e 00000000`77d52292 : 00000000`00000000 00000000`00000000 00000000`039dfd30 00000000`039df0f0 : wow64!RunCpuSimulation+0xa
    0f 00007ffd`65ef8dbb : 00000000`00000000 00000000`77d52120 00000000`00000000 00000000`00000000 : wow64!Wow64LdrpInitialize+0x172
    10 00007ffd`65ef8c9e : 00000000`039df0f0 00000000`00000000 00000000`7f2b4000 00000000`00000000 : ntdll!_LdrpInitialize+0xcb
    11 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!LdrInitializeThunk+0xe




    通过反汇编,查看 nt!KeQueryValuesThread+0x220 处的代码,发现同样是设置IRQL=DISPATCH_LEVEL后,等待线程ffffe001`c9167880 的ThreadLock。

      v5 = __readcr8();
      __writecr8(2ui64);                            // 提升IRQL=DISPATCH_LEVEL
      v6 = 0;
      while ( 1 )
      {
        __asm { lock bts qword ptr [rbx+40h], 0 }   // lock _KTHREAD.ThreadLock
        if ( !_CF )
          break;
        do
        {
          if ( ++v6 & HvlLongSpinCountMask || !(HvlEnlightenments & 0x40) )
            _mm_pause();
          else
            HvlNotifyLongSpinWait(v6);
        }
        while ( *(_QWORD *)(_RBX + 0x40) );         // KeQueryValuesThread+0x220,等待_KTHREAD.ThreadLock释放
      }


    cpu3:

    32.0: kd> .process /r /p ffffe001c8266080
    32.0: kd> kb = ffffd000`d948d170 fffff803`74c206b0 30
     # RetAddr           : Args to Child                                                           : Call Site
    00 fffff803`74c1fb4c : ffffe001`c64f9590 00000000`0000002f 00000000`00000000 fffff801`a6507476 : hal!HalpApicRequestInterrupt+0x100
    01 fffff803`74dcaddc : ffffe001`c9167880 fffff803`74dcecf5 ffffd000`d948d6a0 fffff803`74d79845 : hal!HalRequestSoftwareInterrupt+0xdd
    02 fffff803`74cdb034 : 00000000`04057000 ffffe001`c8ce0a20 ffffe001`ce136900 00000000`00000001 : nt!KiInterruptDispatchNoLockNoEtw+0xbc
    03 fffff803`7504385d : ffffe001`ce136900 ffffe001`c7788080 00000053`d0131fd0 00000000`00000000 : nt!KeQueryValuesThread+0x224
    04 fffff803`7511e569 : 00000000`00000001 fffff803`00018000 fffff580`000a7000 00000000`00000000 : nt!ExpGetProcessInformation+0x17d
    05 fffff803`750e38f5 : 00000053`d0120000 00000000`00000002 00000053`d001f9f0 00000053`d001fa00 : nt!ExpQuerySystemInformation+0x1035
    06 fffff803`74dd92a3 : 00000000`00000001 00000000`00000100 00000000`00000001 00007ff6`643389c0 : nt!NtQuerySystemInformation+0x49
    07 00007ffd`65f70aba : 00007ffd`65c3c4c1 00000000`00000003 00007ff6`6428a73e 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
    08 00007ffd`65c3c4c1 : 00000000`00000003 00007ff6`6428a73e 00000000`00000000 00000000`00000108 : ntdll!NtQuerySystemInformation+0xa
    09 00007ffd`65c3c257 : 00000000`00000002 00000053`00004708 00000053`d001f9f0 00000053`d001fa00 : KERNEL32!ThpCreateRawSnap+0xd2
    0a 00007ff6`64136bc6 : 00000000`00000002 00000053`00004708 ffffffff`fffffffe 00000053`d01a67a0 : KERNEL32!CreateToolhelp32Snapshot+0x107
    0b 00000000`00000002 : 00000053`00004708 ffffffff`fffffffe 00000053`d01a67a0 00000000`00000000 : titan_guard!CheckAgentServiceIsRun+0x66 [d:\jenkens_slave\workspace\agent-release-release\win_custom\vendor-v3.3.0_v3.58.19\agent-monitor\src\titan_guard\check\check_process_windows.cpp @ 151]
    0c 00000053`00004708 : ffffffff`fffffffe 00000053`d01a67a0 00000000`00000000 00000053`d01af4b0 : 0x2
    0d ffffffff`fffffffe : 00000053`d01a67a0 00000000`00000000 00000053`d01af4b0 00000000`00000000 : 0x00000053`00004708
    0e 00000053`d01a67a0 : 00000000`00000000 00000053`d01af4b0 00000000`00000000 00000000`00000029 : 0xffffffff`fffffffe
    0f 00000000`00000000 : 00000053`d01af4b0 00000000`00000000 00000000`00000029 00000000`0000002f : 0x00000053`d01a67a0




    KeQueryValuesThread+0x224位置和cpu1一样,执行的是while ( *(_QWORD *)(_RBX + 0x40) );

    cpu 0,1,3都是在等待线程ffffe001`c9167880的ThreadLock,这个ThreadLock实际上是cpu2设置的

    cpu2:

    32.0: kd> .process /r /p ffffe001ce3885c0
    32.0: kd> kb = ffffd000`d508a170 fffff803`74c206b0 30
     # RetAddr           : Args to Child                                                           : Call Site
    00 fffff803`74c1fb4c : ffffe001`c64f8c30 00000000`0000002f 00000000`00000000 00000000`00000202 : hal!HalpApicRequestInterrupt+0x100
    01 fffff803`74dcaddc : ffffe001`c9167880 006f0048`00650063 ffffd000`d508a6a0 00001f80`00000000 : hal!HalRequestSoftwareInterrupt+0xdd
    02 fffff803`74cdaf64 : 00000000`04057000 ffffe001`c8ce0a20 ffffe001`ce136900 00000000`00000001 : nt!KiInterruptDispatchNoLockNoEtw+0xbc
    03 fffff803`7504385d : 00000000`00000000 0000004f`02291c00 0000004f`02292110 00000000`00000000 : nt!KeQueryValuesThread+0x154
    04 fffff803`7511e569 : 00000000`00000001 fffff803`00018000 fffff580`0009e000 00000000`00000000 : nt!ExpGetProcessInformation+0x17d
    05 fffff803`750e38f5 : 0000004f`02280000 00000000`00000002 0000004f`02b2f210 0000004f`02b2f220 : nt!ExpQuerySystemInformation+0x1035
    06 fffff803`74dd92a3 : ffffe001`c808f080 0000004f`02b2f658 ffffffff`ff676980 00000000`000003dc : nt!NtQuerySystemInformation+0x49
    07 00007ffd`65f70aba : 00007ffd`65c3c4c1 0000004f`00000000 00000000`00000003 0000004f`ffffff01 : nt!KiSystemServiceCopyEnd+0x13
    08 00007ffd`65c3c4c1 : 0000004f`00000000 00000000`00000003 0000004f`ffffff01 00007ffd`fffffd00 : ntdll!NtQuerySystemInformation+0xa
    09 00007ffd`65c3c257 : 00000000`00000002 00000000`00004488 0000004f`02b2f210 0000004f`02b2f220 : KERNEL32!ThpCreateRawSnap+0xd2
    0a 00007ff6`c276fe00 : 0000004f`00000002 0000004f`00004488 00000000`0000008a 0000004f`0207a3a0 : KERNEL32!CreateToolhelp32Snapshot+0x107
    0b 00007ff6`c27713f3 : 0000004f`02076f00 00000000`00000000 0000004f`020799f0 0000004f`02077220 : TitanMonitor!ProcessManager::PreProcSubProcess+0x50 [d:\jenkens_slave\workspace\agent-release-release\win_custom\vendor-v3.3.0_v3.58.19\agent-monitor\src\windows\process_limit.cpp @ 358]
    0c 00007ff6`c277183e : 00004141`41414545 0000004f`02076f00 0000004f`02077220 0000004f`02077220 : TitanMonitor!ProcessManager::CheckSubProcessCPU+0x93 [d:\jenkens_slave\workspace\agent-release-release\win_custom\vendor-v3.3.0_v3.58.19\agent-monitor\src\windows\process_limit.cpp @ 599]
    0d 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : TitanMonitor!ProcessManager::ThreadCheckSubProcessCPU+0xe [d:\jenkens_slave\workspace\agent-release-release\win_custom\vendor-v3.3.0_v3.58.19\agent-monitor\src\windows\process_limit.cpp @ 617]



    cpu2当前的执行位置是 nt!KeQueryValuesThread+0x154, 已经在KeQueryValuesThread+0x220之后,说明cpu2当前执行的线程设置了线程ffffe001`c9167880 的ThreadLock=1,导致其他cpu全部处于死等状态。

    通过以上分析,可以有初步的结论:
    cpu0正在执行NtResumeProcess,遍历到进程内的线程ffffe001`c9167880时,尝试获取线程锁。
    cpu1是系统进程WmiPrvSE.exe,尝试获取ffffe001`c9167880的线程锁。
    cpu2正在执行CreateToolhelp32Snapshot,已经获取到ffffe001`c9167880的线程锁,但是陷入内部循环,导致线程锁不释放
    cpu3正在执行CreateToolhelp32Snapshot,尝试获取ffffe001`c9167880的线程锁。
    所有cpu的当前IRQL=2(DISPATCH_LEVEL),无法切换到其他线程执行。

    通过对比多个dump文件,发现都是相同的场景,有三个cpu在等待同一个线程的ThreadLock,另外一个cpu持有ThreadLock,但是一直不释放。通过分析KeQueryValuesThread的逻辑,发现当线程的_KTHREAD.WaitRegister.State=2时,无法跳出while循环。

    __int64 __fastcall KeQueryValuesThread(__int64 a1, _DWORD *a2)
    {
      // some code blocks
      __writecr8(2ui64);                            // 提升IRQL=DISPATCH_LEVEL
      v6 = 0;
      while ( 1 )
      {
        __asm { lock bts qword ptr [rbx+40h], 0 }   // lock _KTHREAD.ThreadLock
        if ( !_CF )
          break;
        do
        // some code blocks
        while ( *(_QWORD *)(_RBX + 0x40) );         // nt!KeQueryValuesThread+0x220, 等待_KTHREAD.ThreadLock释放
      }
      // some code blocks
      while ( 1 )
      {
        while ( 1 )
        {
          while ( 1 )
          {
            v10 = *(_BYTE *)(_RBX + 0x184);   // _KTHREAD.State
            if ( v10 == 5 )   // WAIT
            {
              v13 = *(_BYTE *)(_RBX + 0x70) & 7;
              if ( v13 == 1 || (unsigned __int8)(v13 - 3) <= 2u ) // _KTHREAD.WaitRegister.State=1/3/4/5可以跳出循环
                goto LABEL_6;
              LOBYTE(v10) = 2;
              goto LABEL_25;
            }
            if ( *(_BYTE *)(_RBX + 0x184) == 1 )
              break;
            if ( *(_BYTE *)(_RBX + 0x184) == 2 )
            {
    LABEL_25:
             // some code blocks
            }
            else
            {
              // some code blocks
            }
          }
          // some code blocks
    LABEL_21:
          if ( _ZF )
            goto LABEL_6;
    LABEL_22:
          _InterlockedAnd8((volatile signed __int64 *)(_RBP + 48), 0i64);// nt!KeQueryValuesThread+0x154
        }
        // some code blocks
        *(_QWORD *)(_RBX + 64) = 0i64;   // unlock _KTHREAD.ThreadLock



    检查线程 ffffe001`c9167880, _KTHREAD.State=5, _KTHREAD.WaitRegister.State=2

    32.0: kd> dt _KTHREAD ffffe001`c9167880
    ntdll!_KTHREAD
       +0x000 Header           : _DISPATCHER_HEADER
       +0x018 SListFaultAddress : (null) 
       +0x020 QuantumTarget    : 0x0000002d`eccefaff
       +0x028 InitialStack     : 0xffffd000`d5943dd0 Void
       +0x030 StackLimit       : 0xffffd000`d593e000 Void
       +0x038 StackBase        : 0xffffd000`d5944000 Void
       +0x040 ThreadLock       : 1
       +0x048 CycleTime        : 0x0000002d`da665f0a
       +0x050 CurrentRunTime   : 0
       +0x054 ExpectedRunTime  : 0x134d9
       +0x058 KernelStack      : 0xffffd000`d5943110 Void
       +0x060 StateSaveArea    : 0xffffd000`d5943e00 _XSAVE_FORMAT
       +0x068 SchedulingGroup  : (null) 
       +0x070 WaitRegister     : _KWAIT_STATUS_REGISTER
       +0x071 Running          : 0 ''
       +0x072 Alerted          : [2]  ""
    ...
       +0x184 State            : 0x5 ''
    32.0: kd> dx -id 0,0,ffffe001c8266080 -r1 (*((ntdll!_KWAIT_STATUS_REGISTER *)0xffffe001c91678f0))
    (*((ntdll!_KWAIT_STATUS_REGISTER *)0xffffe001c91678f0))                 [Type: _KWAIT_STATUS_REGISTER]
        [+0x000] Flags            : 0x2 [Type: unsigned char]
        [+0x000 ( 2: 0)] State            : 0x2 [Type: unsigned char]
        [+0x000 ( 3: 3)] Affinity         : 0x0 [Type: unsigned char]
        [+0x000 ( 4: 4)] Priority         : 0x0 [Type: unsigned char]
        [+0x000 ( 5: 5)] Apc              : 0x0 [Type: unsigned char]
        [+0x000 ( 6: 6)] UserApc          : 0x0 [Type: unsigned char]
        [+0x000 ( 7: 7)] Alert            : 0x0 [Type: unsigned char]

    通过二进制搜索ntoskrnl.exe和实时调试,发现有以下地方会修改线程的 WaitRegister.State,是否还有其他地方修改这个值不确定。

        调用 ResumeThread 继续线程时,先锁住线程,设置WaitRegister.State=5,然后释放线程锁。

      while ( 1 )
      {
        __asm { lock bts qword ptr [rbx+40h], 0 }   // lock _KTHREAD.ThreadLock
        if ( !_CF )
          break;
        do
        {
          // some code blocks
        }
        while ( *(_QWORD *)(_RBX + 0x40) );         // KiResumeThread+0x230, 等待_KTHREAD.ThreadLock释放
      }
      if ( *(_BYTE *)(_RBX + 0x184) != 5 )
        goto LABEL_58;
      v12 = *(_BYTE *)(_RBX + 0x70);
      if ( (*(_BYTE *)(_RBX + 0x70) & 7) != 4 )
        goto LABEL_58;
      if ( v4 )
        goto LABEL_57;
      v13 = 0;
      v14 = 0;
      v29 = 0;
      *(_BYTE *)(_RBX + 0x70) = v12 & 0xFD | 5;     // WaitRegister.State=5
      *(_QWORD *)(_RBX + 0x40) = 0i64;              // unlock _KTHREAD.ThreadLock

        调用 ExpReleaseResourceForThreadLite 释放resource时,先锁住线程,如果当前 WaitRegister.State=5,设置 WaitRegister.State=2,然后释放线程锁。

        v57 = *(_BYTE *)(_RBX + 0x70);              // _KTHREAD.WaitRegister.State
        LOBYTE(vars10) = 0;
        v58 = v57 & 7;
        if ( v58 == 1 || v58 == 4 )                 // _KTHREAD.WaitRegister.State=1,4
        {
          // some code blocks
        }
        else
        {
          if ( v58 )
          {
            if ( v58 == 5 )                         // _KTHREAD.WaitRegister.State=5
            {
              *(_BYTE *)(_RBX + 112) = v55 & 0xFA | 2;// _KTHREAD.WaitRegister.State=2

    结合卡死时的场景分析,怀疑是cpu0在KiResumeThread调用设置线程ffffe001`c9167880的WaitRegister.State=5后,系统某个机制触发了ExpReleaseResourceForThreadLite(或者其他可能修改线程WaitRegister.State的函数)设置线程的WaitRegister.State=2,这个时候cpu2陷入内部循环,导致所有cpu全部卡死。cpu0,1,2上执行的都是应用程序,并没有调用第三方驱动,,应该是触发了内核bug导致系统卡死。

    以上是我这边的分析,目前我们这边能做的是减少调用NtResumeProcess和CreateToolhelp32Snapshot的频率(之前确实调用的比较频繁),但是问题的根源应该还是在系统内核这一块。希望能有人跟进分析这个问题,也可以联系我获取其他几个dump文件或者了解更多信息。





    • 已编辑 keinvo 2019年7月24日 1:56
    2019年7月24日 1:52

全部回复