标 题: 【讨论】如何分析堆栈出错的 dmp 文件
作 者: 小喂
如何分析堆栈出错的 dmp 文件
分析程序出错生成的 dmp 文件是事后分析的主要工作。第一步往往都是使用 WinDbg 自带的 !analyze -v 命令先进行初步分析,得到出错地址和出错堆栈后再进行详细分析。
本文介绍一个方法,当 !analyze -v 不好使的时候应该怎么得到出错地址和出错堆栈。
int sum(int x, int y)
{
__asm mov ebp, 0
return (x + y);
}
int sumstub(int x, int y)
{
int tmp = 0;
printf("enter fun() ...\n");
tmp = sum(x, y);
printf("leave fun() ...\n");
return tmp;
}
int main(int argc, char* argv[])
{
printf("enter main() ...\n");
printf("sum = %d\n", sumstub(0x1234, 0x5678));
printf("leave main() ...\n");
return 0;
}
示例程序比较简单,在 sum 函数里面会把 ebp 清零,下面取 x 或者 y 的值时就会出错。
用 WinDbg 打开出错得到的 dmp 文件,先用 !analyze -v 分析,结果如下:
0:000> !analyze -v
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
*** WARNING: Unable to verify checksum for Dump01.exe
*** ERROR: Symbol file could not be found. Defaulted to export symbols for lpk.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for Sysfer.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for usp10.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for imm32.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for apphelp.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for version.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for advapi32.dll -
*** ERROR: Symbol file could not be found. Defaulted to export symbols for shlwapi.dll -
FAULTING_IP:
+0
00000000 ?? ???
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 00000000
ExceptionCode: 80000007 (Wake debugger)
ExceptionFlags: 00000000
NumberParameters: 0
BUGCHECK_STR: 80000007
PROCESS_NAME: Dump01.exe
ERROR_CODE: (NTSTATUS) 0x80000007 - {
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
DERIVED_WAIT_CHAIN:
Dl Eid Cid WaitType
-- --- ------- --------------------------
0 62c.928 Unknown
WAIT_CHAIN_COMMAND: ~0s;k;;
BLOCKING_THREAD: 00000928
DEFAULT_BUCKET_ID: APPLICATION_HANG_HungIn_ExceptionHandler
PRIMARY_PROBLEM_CLASS: APPLICATION_HANG_HungIn_ExceptionHandler
LAST_CONTROL_TRANSFER: from 7c92e9ab to 7c92eb94
FAULTING_THREAD: 00000928
STACK_TEXT:
0012f3b8 7c92e9ab 7c86372c 00000002 0012f53c ntdll!KiFastSystemCallRet
0012f3bc 7c86372c 00000002 0012f53c 00000001 ntdll!ZwWaitForMultipleObjects+0xc
0012fb38 00401dda 0012fb74 0012ffb0 0012ffc0 kernel32!UnhandledExceptionFilter+0x8e4
0012fb48 00401198 c0000005 0012fb74 0040261b Dump01!_XcptFilter+0x13e
0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xd1
0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23
FOLLOWUP_IP:
Dump01!_XcptFilter+13e
00401dda 5b pop ebx
SYMBOL_STACK_INDEX: 3
SYMBOL_NAME: Dump01!_XcptFilter+13e
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: Dump01
IMAGE_NAME: Dump01.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 46de4ed1
STACK_COMMAND: ~0s ; kb
FAILURE_BUCKET_ID: 80000007_Dump01!_XcptFilter+13e
BUCKET_ID: 80000007_Dump01!_XcptFilter+13e
Followup: MachineOwner
---------
分析得到的出错地址为 0,堆栈也在内核里面。很明显这次 !analyze -v 命令出问题了,需要手动分析才能得到想要的信息。
0:000> ~*kv
. 0 Id: 62c.928 Suspend: 1 Teb: 7ffdf000 Unfrozen
ChildEBP RetAddr Args to Child
0012f3b8 7c92e9ab 7c86372c 00000002 0012f53c ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0012f3bc 7c86372c 00000002 0012f53c 00000001 ntdll!ZwWaitForMultipleObjects+0xc (FPO: [5,0,0])
0012fb38 00401dda 0012fb74 0012ffb0 0012ffc0 kernel32!UnhandledExceptionFilter+0x8e4 (FPO: [Non-Fpo])
0012fb48 00401198 c0000005 0012fb74 0040261b Dump01!_XcptFilter+0x13e
0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xd1
0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])
0:000> !teb
TEB at 7ffdf000
ExceptionList: 0012fb28
StackBase: 00130000
StackLimit: 0012a000
SubSystemTib: 00000000
FiberData: 00001e00
ArbitraryUserPointer: 00000000
Self: 7ffdf000
EnvironmentPointer: 00000000
ClientId: 0000062c . 00000928
RpcHandle: 00000000
Tls Storage: 00000000
PEB Address: 7ffd6000
LastErrorValue: 0
LastStatusValue: 103
Count Owned Locks: 0
HardErrorMode: 0
先查看所有线程的堆栈信息,然后找出比较像出了问题的线程。本次示例只有一个线程,所以肯定是该线程出错。然后显示出错线程的 TEB 信息。
0:000> dps 0x0012a000 0x00130000
根据堆栈的位置和大小,显示堆栈的所有内容。
根据 Windows 异常处理流程可知,所有没被调试器处理的异常最终都会转到 ntdll!KiUserExceptionDispatcher 函数查找 SEH 异常处理例程来处理异常。所以在显示的堆栈信息中查找 ntdll!KiUserExceptionDispatcher 字符串。
0012fc50 00000000
0012fc54 7c92eafa ntdll!KiUserExceptionDispatcher+0xe
0012fc58 00000000
0012fc5c 0012fc84
再根据 KiUserExceptionDispatcher 函数的原型得到本次异常发生时保存的 CONTEXT 结构信息。
; VOID
; KiUserExceptionDispatcher (
; IN PEXCEPTION_RECORD ExceptionRecord,
; IN PCONTEXT ContextRecord
; )
第二个参数指向 CONTEXT 结构,利用 WinDbg 的 .cxr 命令显示/切换 CONTEXT 结构。
0:000> .cxr 0x0012fc84
eax=00005678 ebx=7ffd6000 ecx=00001234 edx=7c92eb94 esi=011dd664 edi=011dd65c
eip=0040100b esp=0012ff50 ebp=00000000 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
Dump01!sum+0xb:
0040100b 8b4508 mov eax,dword ptr [ebp+8] ss:0023:00000008=????????
0:000> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
00000000 00000000 00000000 00000000 00000000 Dump01!sum+0xb (CONV: cdecl) [E:\Works\Dump01\Dump01.cpp @ 10]
现在已经找到出错地址为 0x0040100b,下面恢复正确的出错堆栈。
0:000> ?? sizeof(ntdll!_CONTEXT)
unsigned int 0x2cc
0:000> ? 0x0012fc84 + 0x2cc
Evaluate expression: 1245008 = 0012ff50
计算可知,出错前的堆栈位置在 0x0012ff50 处。
0:000> ub 0x0040100b L 6
Dump01!sum [E:\Works\Dump01\Dump01.cpp @ 7]:
00401000 55 push ebp
00401001 8bec mov ebp,esp
00401003 53 push ebx
00401004 56 push esi
00401005 57 push edi
00401006 bd00000000 mov ebp,0
0:000> dps 0x0012ff50 L 0x10
0012ff50 011dd65c
0012ff54 011dd664
0012ff58 7ffd6000
0012ff5c 0012ff70
0012ff60 0040103b Dump01!sumstub+0x25 [E:\Works\Dump01\Dump01.cpp @ 19]
0012ff64 00001234
0012ff68 00005678
0012ff6c 00000000
0012ff70 0012ff80
0012ff74 00401074 Dump01!main+0x1f [E:\Works\Dump01\Dump01.cpp @ 30]
0012ff78 00001234
0012ff7c 00005678
0012ff80 0012ffc0
0012ff84 0040117b Dump01!mainCRTStartup+0xb4
0012ff88 00000001
0012ff8c 00520eb0
0:000> r
Last set context:
eax=00005678 ebx=7ffd6000 ecx=00001234 edx=7c92eb94 esi=011dd664 edi=011dd65c
eip=0040100b esp=0012ff50 ebp=00000000 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
Dump01!sum+0xb:
0040100b 8b4508 mov eax,dword ptr [ebp+8] ss:0023:00000008=????????
反汇编出错地址前的几条指令,可以知道出错原因是 0x00401006 处的指令导致 ebp 被赋零,所以接下来取参数的指令出错。再根据堆栈信息,出错前往堆栈中压入了 ebx/esi/edi 几个寄存器的值,对比 0x0012ff50 处的堆栈,可知 0x0012ff50 正好是程序出错前的堆栈地址。同时还可以得到保存在堆栈上的 ebp 的值,从而得到正确的出错堆栈。
0:000> kv L = 0x0012ff5c
ChildEBP RetAddr Args to Child
0012ff5c 0040103b 00001234 00005678 00000000 Dump01!sum+0xb (CONV: cdecl)
0012ff70 00401074 00001234 00005678 0012ffc0 Dump01!sumstub+0x25 (CONV: cdecl)
0012ff80 0040117b 00000001 00520eb0 00520e20 Dump01!main+0x1f (CONV: cdecl)
0012ffc0 7c816fd7 011dd65c 011dd664 7ffd6000 Dump01!mainCRTStartup+0xb4
0012fff0 00000000 004010c7 00000000 00000000 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])
从这个堆栈来看,起始地址从 kernel32!BaseProcessStart 开始,结束地址也正好在出错地址处,应该是正确的出错堆栈。