标 题: 【讨论】如何分析堆栈出错的 dmp 文件
作 者: 小喂

如何分析堆栈出错的 dmp 文件


分析程序出错生成的 dmp 文件是事后分析的主要工作。第一步往往都是使用 WinDbg 自带的 !analyze -v 命令先进行初步分析,得到出错地址和出错堆栈后再进行详细分析。

本文介绍一个方法,当 !analyze -v 不好使的时候应该怎么得到出错地址和出错堆栈。

int sum(int x, int y)
    {
        __asm mov ebp, 0

        return (x + y);
    }

    int sumstub(int x, int y)
    {
        int  tmp = 0;

        printf("enter fun() ...\n");

        tmp = sum(x, y);

        printf("leave fun() ...\n");

        return tmp;
    }

    int main(int argc, char* argv[])
    {
        printf("enter main() ...\n");

        printf("sum = %d\n", sumstub(0x1234, 0x5678));

        printf("leave main() ...\n");

        return 0;
    }



示例程序比较简单,在 sum 函数里面会把 ebp 清零,下面取 x 或者 y 的值时就会出错。

用 WinDbg 打开出错得到的 dmp 文件,先用 !analyze -v 分析,结果如下:

0:000> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Exception Analysis                                   *
    *                                                                             *
    *******************************************************************************

    *** WARNING: Unable to verify checksum for Dump01.exe
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for lpk.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for Sysfer.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for usp10.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for imm32.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for apphelp.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for version.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for advapi32.dll - 
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for shlwapi.dll - 

    FAULTING_IP: 
    +0
    00000000 ??              ???

    EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
    ExceptionAddress: 00000000
       ExceptionCode: 80000007 (Wake debugger)
      ExceptionFlags: 00000000
    NumberParameters: 0

    BUGCHECK_STR:  80000007

    PROCESS_NAME:  Dump01.exe

    ERROR_CODE: (NTSTATUS) 0x80000007 - {

    NTGLOBALFLAG:  0

    APPLICATION_VERIFIER_FLAGS:  0

    DERIVED_WAIT_CHAIN:  

    Dl Eid Cid     WaitType
    -- --- ------- --------------------------
       0   62c.928 Unknown                

    WAIT_CHAIN_COMMAND:  ~0s;k;;

    BLOCKING_THREAD:  00000928

    DEFAULT_BUCKET_ID:  APPLICATION_HANG_HungIn_ExceptionHandler

    PRIMARY_PROBLEM_CLASS:  APPLICATION_HANG_HungIn_ExceptionHandler

    LAST_CONTROL_TRANSFER:  from 7c92e9ab to 7c92eb94

    FAULTING_THREAD:  00000928

    STACK_TEXT:  
    0012f3b8 7c92e9ab 7c86372c 00000002 0012f53c ntdll!KiFastSystemCallRet
    0012f3bc 7c86372c 00000002 0012f53c 00000001 ntdll!ZwWaitForMultipleObjects+0xc
    0012fb38 00401dda 0012fb74 0012ffb0 0012ffc0 kernel32!UnhandledExceptionFilter+0x8e4
    0012fb48 00401198 c0000005 0012fb74 0040261b Dump01!_XcptFilter+0x13e
    0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xd1
    0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23


    FOLLOWUP_IP: 
    Dump01!_XcptFilter+13e
    00401dda 5b              pop     ebx

    SYMBOL_STACK_INDEX:  3

    SYMBOL_NAME:  Dump01!_XcptFilter+13e

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: Dump01

    IMAGE_NAME:  Dump01.exe

    DEBUG_FLR_IMAGE_TIMESTAMP:  46de4ed1

    STACK_COMMAND:  ~0s ; kb

    FAILURE_BUCKET_ID:  80000007_Dump01!_XcptFilter+13e

    BUCKET_ID:  80000007_Dump01!_XcptFilter+13e

    Followup: MachineOwner
    ---------



分析得到的出错地址为 0,堆栈也在内核里面。很明显这次 !analyze -v 命令出问题了,需要手动分析才能得到想要的信息。

0:000> ~*kv

    .  0  Id: 62c.928 Suspend: 1 Teb: 7ffdf000 Unfrozen
    ChildEBP RetAddr  Args to Child              
    0012f3b8 7c92e9ab 7c86372c 00000002 0012f53c ntdll!KiFastSystemCallRet (FPO: [0,0,0])
    0012f3bc 7c86372c 00000002 0012f53c 00000001 ntdll!ZwWaitForMultipleObjects+0xc (FPO: [5,0,0])
    0012fb38 00401dda 0012fb74 0012ffb0 0012ffc0 kernel32!UnhandledExceptionFilter+0x8e4 (FPO: [Non-Fpo])
    0012fb48 00401198 c0000005 0012fb74 0040261b Dump01!_XcptFilter+0x13e
    0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xd1
    0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])

    0:000> !teb
    TEB at 7ffdf000
        ExceptionList:        0012fb28
        StackBase:            00130000
        StackLimit:           0012a000
        SubSystemTib:         00000000
        FiberData:            00001e00
        ArbitraryUserPointer: 00000000
        Self:                 7ffdf000
        EnvironmentPointer:   00000000
        ClientId:             0000062c . 00000928
        RpcHandle:            00000000
        Tls Storage:          00000000
        PEB Address:          7ffd6000
        LastErrorValue:       0
        LastStatusValue:      103
        Count Owned Locks:    0
        HardErrorMode:        0



先查看所有线程的堆栈信息,然后找出比较像出了问题的线程。本次示例只有一个线程,所以肯定是该线程出错。然后显示出错线程的 TEB 信息。

0:000> dps 0x0012a000 0x00130000



根据堆栈的位置和大小,显示堆栈的所有内容。

根据 Windows 异常处理流程可知,所有没被调试器处理的异常最终都会转到 ntdll!KiUserExceptionDispatcher 函数查找 SEH 异常处理例程来处理异常。所以在显示的堆栈信息中查找 ntdll!KiUserExceptionDispatcher 字符串。

0012fc50  00000000
    0012fc54  7c92eafa ntdll!KiUserExceptionDispatcher+0xe
    0012fc58  00000000
    0012fc5c  0012fc84



再根据 KiUserExceptionDispatcher 函数的原型得到本次异常发生时保存的 CONTEXT 结构信息。

; VOID
    ; KiUserExceptionDispatcher (
    ;    IN PEXCEPTION_RECORD ExceptionRecord,
    ;    IN PCONTEXT ContextRecord
    ;    )



第二个参数指向 CONTEXT 结构,利用 WinDbg 的 .cxr 命令显示/切换 CONTEXT 结构。

0:000> .cxr 0x0012fc84
    eax=00005678 ebx=7ffd6000 ecx=00001234 edx=7c92eb94 esi=011dd664 edi=011dd65c
    eip=0040100b esp=0012ff50 ebp=00000000 iopl=0         nv up ei pl nz na pe nc
    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
    Dump01!sum+0xb:
    0040100b 8b4508          mov     eax,dword ptr [ebp+8] ss:0023:00000008=????????

    0:000> kv
      *** Stack trace for last set context - .thread/.cxr resets it
    ChildEBP RetAddr  Args to Child              
    00000000 00000000 00000000 00000000 00000000 Dump01!sum+0xb (CONV: cdecl) [E:\Works\Dump01\Dump01.cpp @ 10]



现在已经找到出错地址为 0x0040100b,下面恢复正确的出错堆栈。

0:000> ?? sizeof(ntdll!_CONTEXT)
    unsigned int 0x2cc

    0:000> ? 0x0012fc84 + 0x2cc
    Evaluate expression: 1245008 = 0012ff50



计算可知,出错前的堆栈位置在 0x0012ff50 处。

0:000> ub 0x0040100b L 6
    Dump01!sum [E:\Works\Dump01\Dump01.cpp @ 7]:
    00401000 55              push    ebp
    00401001 8bec            mov     ebp,esp
    00401003 53              push    ebx
    00401004 56              push    esi
    00401005 57              push    edi
    00401006 bd00000000      mov     ebp,0

    0:000> dps 0x0012ff50 L 0x10
    0012ff50  011dd65c
    0012ff54  011dd664
    0012ff58  7ffd6000
    0012ff5c  0012ff70
    0012ff60  0040103b Dump01!sumstub+0x25 [E:\Works\Dump01\Dump01.cpp @ 19]
    0012ff64  00001234
    0012ff68  00005678
    0012ff6c  00000000
    0012ff70  0012ff80
    0012ff74  00401074 Dump01!main+0x1f [E:\Works\Dump01\Dump01.cpp @ 30]
    0012ff78  00001234
    0012ff7c  00005678
    0012ff80  0012ffc0
    0012ff84  0040117b Dump01!mainCRTStartup+0xb4
    0012ff88  00000001
    0012ff8c  00520eb0

    0:000> r
    Last set context:
    eax=00005678 ebx=7ffd6000 ecx=00001234 edx=7c92eb94 esi=011dd664 edi=011dd65c
    eip=0040100b esp=0012ff50 ebp=00000000 iopl=0         nv up ei pl nz na pe nc
    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
    Dump01!sum+0xb:
    0040100b 8b4508          mov     eax,dword ptr [ebp+8] ss:0023:00000008=????????



反汇编出错地址前的几条指令,可以知道出错原因是 0x00401006 处的指令导致 ebp 被赋零,所以接下来取参数的指令出错。再根据堆栈信息,出错前往堆栈中压入了 ebx/esi/edi 几个寄存器的值,对比 0x0012ff50 处的堆栈,可知 0x0012ff50 正好是程序出错前的堆栈地址。同时还可以得到保存在堆栈上的 ebp 的值,从而得到正确的出错堆栈。

0:000> kv L = 0x0012ff5c
    ChildEBP RetAddr  Args to Child              
    0012ff5c 0040103b 00001234 00005678 00000000 Dump01!sum+0xb (CONV: cdecl)
    0012ff70 00401074 00001234 00005678 0012ffc0 Dump01!sumstub+0x25 (CONV: cdecl)
    0012ff80 0040117b 00000001 00520eb0 00520e20 Dump01!main+0x1f (CONV: cdecl)
    0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xb4
    0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])


从这个堆栈来看,起始地址从 kernel32!BaseProcessStart 开始,结束地址也正好在出错地址处,应该是正确的出错堆栈。