Android系统稳定性简述
稳定性问题从表现来看有: 死机重启, 自动关机, 无法开机,冻屏,黑屏以及闪退, 无响应等情况;
从技术层面来划分无外乎两大类: 长时间无法执行完成(Timeout) 以及异常崩溃(crash). 主要分类如下:
1 kernel panic
kernel panic相关错误信息代码处理:
arch/arm64/mm/fault.c
arch/arm64/kernel/traps.c
kernel/panic.c
1.1 常见错误打印
常见异常原因有hungtask,panic,oom,watchdog,softlockup
1 hungtask
内核某进程/线程长期处于D状态,无法唤醒
INFO: task xxx:xxx blocked for more than 120 seconds.
2 panic
内核崩溃,错误原因很多,比如页表异常、空指针访问等。
"Data abort info:\n"
"Mem abort info:\n"
"Unable to handle kernel %s at virtual address %016lx\n"
"Kernel panic - not syncing: %s\n"
Internal error: Oops: 96000006 [#1] PREEMPT SMP
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[2022/2/27 21:26:18] [254094.838383] !(0)[675:tosMsgAdvTask] list_add corruption. prev->next should be next (ffffff802e905648), but was ffffff8016ca8540. (prev=ffffff8016ca81c0).
[2022/2/27 21:26:18] [254094.851569] -(0)[675:tosMsgAdvTask] ------------[ cut here ]------------
[2022/2/27 21:26:18] [254094.854755] -(0)[675:tosMsgAdvTask] kernel BUG at lib/list_debug.c:28!
[2022/2/27 21:26:18] [254094.861344] -(0)[675:tosMsgAdvTask] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[251:irq/63-vdec-1] Call trace:
[251:irq/63-vdec-1] [ffffffc020253d20+ 96][<ffffffc00a1401b8>] vh264_isr_thread_fn+0x22d8/0x2640 [amvdec_mh264_v4l]
[251:irq/63-vdec-1] [ffffffc020253d80+ 48][<ffffffc009f10864>] vdec_thread_isr+0x8c/0xe8 [decoder_common]
[251:irq/63-vdec-1] [ffffffc020253db0+ 80][<ffffffc01016ef80>] irq_thread_fn+0x30/0x78
[251:irq/63-vdec-1] [ffffffc020253e00+ 96][<ffffffc01016ee30>] irq_thread+0x1e8/0x260
[251:irq/63-vdec-1] [ffffffc020253e60+ 0][<ffffffc01011de70>] kthread+0x14c/0x15c
[251:irq/63-vdec-1] [0000000000000000+ 0][<ffffffc010084070>] ret_from_fork+0x10/0x18
[251:irq/63-vdec-1] Code: 9400183c f85f03a8 52801009 f9539908 (39138109)
[251:irq/63-vdec-1] ---[ end trace c5cc2e4cf0888320 ]---
[251:irq/63-vdec-1] dyn_fsync: panic_flush_device_cache
[251:irq/63-vdec-1] dyn_fsync: panic_flush_device_cache: flush device cache
3 memory分配异常
[9:ksoftirqd/0] warn_alloc: 203 callbacks suppressed
[9:ksoftirqd/0] ksoftirqd/0: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
[9:ksoftirqd/0] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: P WC O 5.4.125-android11-2-gdc4d0f1d4a41-dirty #1
[9:ksoftirqd/0] Hardware name: Amlogic (DT)
[9:ksoftirqd/0] Call trace:
[9:ksoftirqd/0] [ffffffc020023860+ 64][<ffffffc0100c61f4>] dump_backtrace+0x0/0x18c
[9:ksoftirqd/0] [ffffffc0200238a0+ 32][<ffffffc0100c63a4>] show_stack+0x24/0x34
[9:ksoftirqd/0] [ffffffc0200238c0+ 160][<ffffffc010de41e0>] dump_stack+0xbc/0x108
[9:ksoftirqd/0] [ffffffc020023960+ 240][<ffffffc0102d640c>] warn_alloc+0xd8/0x128
[9:ksoftirqd/0] [ffffffc020023a50+ 144][<ffffffc0102d7744>] __alloc_pages_slowpath+0xb4c/0xb78
[9:ksoftirqd/0] [ffffffc020023ae0+ 80][<ffffffc0102d67b8>] __alloc_pages_nodemask+0x2ec/0x374
[9:ksoftirqd/0] [ffffffc020023b30+ 48][<ffffffc0102d7a28>] page_frag_alloc+0x100/0x1a0
[9:ksoftirqd/0] [ffffffc020023b60+ 256][<ffffffc010b760c0>] __napi_alloc_skb+0x74/0x198
[9:ksoftirqd/0] [ffffffc020023c60+ 96][<ffffffc0099083d0>] stmmac_napi_poll_rx+0x3d4/0xacc [stmmac]
[9:ksoftirqd/0] [ffffffc020023cc0+ 96][<ffffffc010b9df34>] napi_poll+0x70/0x224
[9:ksoftirqd/0] [ffffffc020023d20+ 96][<ffffffc010b9d5c4>] net_rx_action+0xc0/0x238
[9:ksoftirqd/0] [ffffffc020023d80+ 96][<ffffffc010081384>] __do_softirq+0x184/0x394
[9:ksoftirqd/0] [ffffffc020023de0+ 32][<ffffffc0100f6858>] run_ksoftirqd+0x30/0x7c
[9:ksoftirqd/0] [ffffffc020023e00+ 96][<ffffffc010122954>] smpboot_thread_fn+0x1c4/0x2c0
[9:ksoftirqd/0] [ffffffc020023e60+ 0][<ffffffc01011d5e4>] kthread+0x14c/0x15c
[9:ksoftirqd/0] [0000000000000000+ 0][<ffffffc010084070>] ret_from_fork+0x10/0x18
[9:ksoftirqd/0] Mem-Info:
4 oom
Linux内核有个机制叫OOM killer(Out-Of-Memory killer),该机制会监控那些占用内存过大,尤其是瞬间很快消耗大量内存的进程,为了防止内存耗尽,内核会按优先级等相关来杀掉一些进程,直到free内存在一定范围内
Out of memory: Killed process 225 (busybox) total-vm:1180kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:20kB oom_score_adj:0
oom_reaper: reaped process 225 (busybox), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Killed
busybox invoked oom-killer: gfp_mask=0x2100cca(GFP_HIGHUSER_MOVABLE|0x2000000), order=0, oom_score_adj=0
CPU: 0 PID: 228 Comm: busybox Tainted: G W 5.4.125-android11-2-g58da01d48f87-dirty #1
Hardware name: Amlogic (DT)
Call trace:
[ffffffc0201239e0+ 48][<ffffffc0100be19c>] dump_backtrace+0x0/0x110
[ffffffc020123a10+ 32][<ffffffc0100be378>] show_stack+0x24/0x34
[ffffffc020123a30+ 64][<ffffffc010ae2338>] dump_stack+0xbc/0x108
[ffffffc020123a70+ 48][<ffffffc010245d30>] dump_header+0x50/0x1d4
[ffffffc020123aa0+ 48][<ffffffc010245bd4>] oom_kill_process+0x9c/0x1a8
[ffffffc020123ad0+ 208][<ffffffc010245670>] out_of_memory+0x1b4/0x27c
[ffffffc020123ba0+ 144][<ffffffc01028fba0>] __alloc_pages_slowpath+0x740/0x9cc
[ffffffc020123c30+ 80][<ffffffc01028f0e8>] __alloc_pages_nodemask+0x2b8/0x30c
[ffffffc020123c80+ 80][<ffffffc01027a368>] wp_page_copy+0x98/0x4d0
[ffffffc020123cd0+ 48][<ffffffc010278cc8>] do_wp_page+0x304/0x45c
[ffffffc020123d00+ 224][<ffffffc01027b054>] handle_pte_fault+0xe0/0xac4
[ffffffc020123de0+ 64][<ffffffc01027944c>] handle_mm_fault+0x1c8/0x3c8
[ffffffc020123e20+ 96][<ffffffc010aff754>] do_page_fault+0x2a4/0x420
[ffffffc020123e80+ 368][<ffffffc010080a2c>] do_mem_abort+0x50/0xbc
[ffffffc020123ff0+ 0][<ffffffc010083848>] el0_da+0x1c/0x20
Mem-Info:
5 soft lockup
soft lockup是指内核中有BUG导致在内核模式下一直循环的时间超过n s(n为配置参数),而其他进程得不到运行的机会;实现方式:内核对于每一个cpu都有一个监控进程watchdog/x 每秒钟会对比进程时间戳,对比时间戳就可以知道运行情况了,进程长时间没有运行,时间戳没有更新,超过一定时间就报错。hard lockup的发生是由于禁止了CPU的所有中断超过一定时间(几秒)这种情况下,外部设备发生的中断无法处理,内核认为此时发生了所谓的lockup。
下图log中发生lockup的线程为aml-ditest-0
BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 79s) not found.
[8865:cae_front] dyn_fsync: panic_flush_device_cache
[8865:cae_front] dyn_fsync: panic_flush_device_cache: flush device cache
[0:swapper/2] RT throttling on cpu:3 rt_time:991ms, curr:aml-ditest-0/165 prio:0 sum_runtime:248810ms
[0:swapper/2] RT throttling on cpu:3 rt_time:991ms, curr:aml-ditest-0/165 prio:0 sum_runtime:249802ms
[8865:cae_front] dyn_fsync: panic_flush_device_cache: remaining timeout = 0
[8865:cae_front] Kernel panic - not syncing: workqueue lockup !!!
1.2 案列分析
Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0
Mem abort info:
ESR = 0x96000006
EC = 0x25: DABT (current EL), IL = 32 bits
Data abort info:
ISV = 0, ISS = 0x00000006
user pgtable: 4k pages, 39-bit VAs, pgdp=0000000065733000
pgd=0000000062020003, pud=0000000062020003, pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
从Unable to handle kernel NULL pointer dereference at virtual address 00000000000002c0和Mem abort info两个地方看出panic的原因为空指针引起,ESR = 0x96000006, ESR为ARM异常寄存器错误信息,96000006 [#1] 后面的数字1表示发生的次数
Modules linked in: adc_keypad(O) ad82584f(O) snd_soc_aml_t9015(O) snd_soc(O) snd_soc_dummy_codec(O) aml_media(O) aml_dvb_extern(O) aml_spicc snd_soc_notify(O) meson_clk_measure aml_gki_tool(O) aml_mailbox(O)
PC : ffffffc009c329f0, PFN:8d4a4 V
SP : ffffff809612ad70, PFN:9612a L
FAR : 00000000000002c0 U
R0 : ffffffc009c38a5f, PFN:8d4aa V
R1 : ffffffc009c38000, PFN:8d4aa V
R3 : ffffffc01649b000, PFN:***** V
R9 : ffffffc0113a5cb0, PFN: 23a5 V
R15 : ffffffc0110e56d7, PFN: 20e5 V
R16 : ffffffc010debb00, PFN:***** V
R18 : ffffff809610b840, PFN:9610b L
R19 : ffffffc009c53000, PFN:8d4c5 V
R29 : ffffff809612ada0, PFN:9612a L
R30 : ffffffc009c329d0, PFN:8d4a4 V
CPU: 1 PID: 2286 Comm: log_out Tainted: P C O 5.4.125-android11-2-ge3776401faa0-dirty-ab97 #1
Hardware name: Amlogic (DT)
pstate: 20400089 (nzCv daIf +PAN -UAO)
pc : _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
lr : _0d503df0a3108831179ecc8bd8eab8d7+0xfb0/0x1294 [pm_kara_release]
[ffffffc02162be20+ 32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+ 96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+ 0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
[ffffff809612ada0+ 96][] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
Modules linked in 后面是动态加载的ko模块,如果panic是模块中问题,会在堆栈后面打印对应的模块名,从下面这个log中发现出错的地方为pm_kara_release模块 ,同时会把PC等相关的寄存器信息打印出来
Call trace:
[ffffff809612ada0+ 96][<ffffffc009c329f0>] _0d503df0a3108831179ecc8bd8eab8d7+0xfd0/0x1294 [pm_kara_release]
[ffffff809612ae00+ 32][<ffffffc009c31650>] _d0571ed82a857ce697fd4a1931c7f2f9+0xad8/0xb18 [pm_kara_release]
[ffffffc02162bd60+ 48][<ffffffc01030b440>] __vfs_write+0x160/0x184
[ffffffc02162bd90+ 80][<ffffffc01030b608>] vfs_write+0xe8/0x1ac
[ffffffc02162bde0+ 64][<ffffffc01030b844>] ksys_write+0x78/0xcc
[ffffffc02162be20+ 32][<ffffffc01030b8bc>] __arm64_sys_write+0x24/0x34
[ffffffc02162be40+ 96][<ffffffc0100d0f40>] el0_svc_common+0xc8/0x258
[ffffffc02162bea0+ 336][<ffffffc0100d10fc>] el0_svc_compat_handler+0x2c/0x3c
[ffffffc02162bff0+ 0][<ffffffc010083524>] el0_svc_compat+0x8/0x24
Code: f9408908 d0000020 d0000021 91297c00 (b942c104)
---[ end trace 589d34c66e567ecd ]---
Kernel panic - not syncing: Fatal exception in interrupt
Call trace将panic的堆栈信息打印出来,panic出现的地方大概有两种情况,一种是出现在进程上下文和中断上下文中,进程上下文中发生的panic为用户空间通过系统调用驱动程序引起,中断上下文中发生的panic为处理中断程序时发生。下面的el0_svc_compat->el0_svc_compat_handler->el0_svc_common->这个过程发现引发的panic为用户空间调用write系统调用时触发。
从上面这个信息可以清楚知道发生的模块为pm_kara_release,不能用add2line工具解析[ffffff809612ada0+ 96]这个地址,这个地址为vmlinux中的地址,如果直接转换可能会是?号或者行数不对应等。
- 通过addr2line工具解析pm_kara_release模块中0d503df0a3108831179ecc8bd8eab8d7这个地址,解析出对应的函数,大致发生的地方在哪个位置
- 然后通过objdump反汇编出来,在0d503df0a3108831179ecc8bd8eab8d7地址的基础上+0xfd0为对应的行,这个地方为最终出问题的地方也可以通过nm工具来搜索0d503df0a3108831179ecc8bd8eab8d7这个地址
解析ko地址时,首先要确认ko上基于32位编译的还是64位编译的,否则会报如下错误,ko文件需要使用带有符号的ko或者lib
arm-linux-androideabi-addr2line: out/media.ko: File format not recognized
1.3 常用工具
// objdump -d <file(s)>: 将代码段反汇编
// objdump -S <file(s)>: 将代码段反汇编 arm-linux-objdump -d -S file
arm-linux-objdump
arm-linux-nm
arm-linux-add2line
aarch64-linux-gnu-addr2line
aarch64-none-elf
2 tombstoned
tombstoned主要信息如下
1 死亡进程相关信息
pid : tombstoned的进程ID
tid :线程ID
name: 线程名
pid: 438, tid: 562, name: HwBinder:438_1 >>> /vendor/tvos/bin/sitatvservice <<<
2异常退出原因
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1 Cause: null pointer dereference
3 backtrace
3 寄存器相关信息
4 虚拟内存映射关系
5 各个线程的堆栈信息
6 logcat相关信息
2.1 案列分析
Build fingerprint: 'TCL/t982_ar31a8/t982_ar31a8:11/RQ3A.210705.001/15:userdebug/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2022-03-02 09:27:08+0800
pid: 438, tid: 562, name: HwBinder:438_1 >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1
Cause: null pointer dereference
r0 eae36ee8 r1 eae36ee0 r2 00000000 r3 00000200
r4 00000006 r5 eae36ee0 r6 eae36edc r7 00000001
r8 eae36ed8 r9 eadef2d8 r10 eadf0ced r11 eadf0cda
ip eca8da30 sp e8a5fd40 lr eca10a85 pc eae1495e
backtrace:
NOTE: Function names and BuildId information is missing for some frames due
NOTE: to unreadable libraries. For unwinds of apps, only shared libraries
NOTE: found under the lib/ directory are readable.
NOTE: On this device, run setenforce 0 to make the libraries readable.
#00 pc 0004895e /vendor/lib/libamlhalcore.so (removeCommandValue+146) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#01 pc 000491a9 /vendor/lib/libamlhalcore.so (Aml_Hal_SetMasterVol+68) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#02 pc 0005191f /vendor/lib/libamlhalcore.so (AML_HAL_CEC_setDeviceConnectionState+678) (BuildId: 5cf25626a221d171013afd9d6a81d5c5)
#03 pc 000107a7 /vendor/lib/libsystemcontrolservice.so
进程退出原因为空指针异常,backtrace为出错时的对应线程的堆栈打印信息:在libamlhalcore.so中的removeCommandValue+146处挂掉了,通过addr2line工具定位到具体出错是哪个源文件里的哪个函数,确保debug的libamlhalcore.so和对应测试的版本一致,要不然可能会解析错误或者?等
arm-linux-add2line -e libamlhalcore.so -Cfs 0004895e
EX_TRANSACTION_FAILED
Build fingerprint: 'TCL/t982_ar31a8/t982_ar31a8:11/RQ3A.210705.001/AR01:userdebug/release-keys'
Revision: '0'
ABI: 'arm'
Timestamp: 2022-02-10 17:01:29+0800
pid: 449, tid: 497, name: binder_server >>> /vendor/tvos/bin/sitatvservice <<<
uid: 0
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
Abort message: 'Failed HIDL return status not checked. Usually this happens because of a transport error (error parceling, binder driver, or from unparceling). If you see this in code calling into "Bn" classes in for a HAL server process, then it is likely that the code there is returning transport errors there (as opposed to errors defined within its protocol). Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT: ''
r0 00000000 r1 000001f1 r2 00000006 r3 f1484bd0
r4 f1484be4 r5 f1484bc8 r6 000001c1 r7 0000016b
r8 f1484bd0 r9 f1484be0 r10 f1484c00 r11 f1484bf0
ip 000001f1 sp f1484ba0 lr f19eeb0d pc f19eeb20
出现Error is: Status(EX_TRANSACTION_FAILED): 'DEAD_OBJECT:的情况为是hidl binder客户端向服务端发送请求时,binder服务端进程挂掉了,没有添加错误检测,导致引发SIGABRT,添加Return<>和linktoDeatch机制
2.2 debuggerd
debuggerd -b [pid],可输出进程的Native调用栈
[2022-08-12 11:10:02] ----- pid 9346 at 2022-08-12 11:10:03 -----
[2022-08-12 11:10:02] Cmd line: system_server
[2022-08-12 11:10:02] ABI: 'arm'
[2022-08-12 11:10:02]
[2022-08-12 11:10:02] "system_server" sysTid=9346
[2022-08-12 11:10:02] #00 pc 00053ebc /system/lib/libc.so (__ioctl+8)
[2022-08-12 11:10:02] #01 pc 00021bb9 /system/lib/libc.so (ioctl+36)
[2022-08-12 11:10:02] #02 pc 001d768c /system/lib/libsitatv.so (sita_android::IPCThreadState::talkWithDriver(bool)+164)
[2022-08-12 11:10:02] #03 pc 001d7e74 /system/lib/libsitatv.so (sita_android::IPCThreadState::waitForResponse(sita_android::Parcel*, int*)+48)
[2022-08-12 11:10:02] #04 pc 001d87a0 /system/lib/libsitatv.so (sita_android::IPCThreadState::transact(int, unsigned int, sita_android::Parcel const&, sita_android::Parcel*, unsigned int)+160)
[2022-08-12 11:10:02] #05 pc 001d6054 /system/lib/libsitatv.so (sita_android::BpBinder::transact(unsigned int, sita_android::Parcel const&, sita_android::Parcel*, unsigned int)+68)
[2022-08-12 11:10:02] #06 pc 000889ec /system/lib/libsitatv.so (tos_factory_get_project_id+164)
[2022-08-12 11:10:02] #07 pc 00017b15 /system/lib/libcom_tcl_tv_jni.so
[2022-08-12 11:10:02] #08 pc 0000808f /system/framework/arm/boot-com.tcl.tvmanager.oat (offset 0x8000) (com.tcl.tosapi.atv.TFactoryApi.native_addCallBack [DEDUPED]+78)
[2022-08-12 11:10:02] #09 pc 00417f75 /system/lib/libart.so (art_quick_invoke_stub_internal+68)
[2022-08-12 11:10:02] #10 pc 003f150f /system/lib/libart.so (art_quick_invoke_stub+226)
[2022-08-12 11:10:02] #11 pc 000a1031 /system/lib/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+136)
[2022-08-12 11:10:02] #12 pc 001e8a25 /system/lib/libart.so (art::interpreter::ArtInterpreterToCompiledCodeBridge(art::Thread*, art::ArtMethod*, art::ShadowFrame*, unsigned short, art::JValue*)+232)
[2022-08-12 11:10:02] #13 pc 001e3701 /system/lib/libart.so (bool art::interpreter::DoCall<false, false>(art::ArtMethod*, art::Thread*, art::ShadowFrame&, art::Instruction const*, unsigned short, art::JValue*)+776)
2.3 常见错误码
出现问题时,一帮是signal 11 (SIGSEGV) 和 signal 6 (SIGABRT)
#define SIGHUP 1
#define SIGINT 2
#define SIGQUIT 3
#define SIGILL 4
#define SIGTRAP 5
#define SIGABRT 6
#define SIGIOT 6
#define SIGBUS 7
#define SIGFPE 8
#define SIGKILL 9
#define SIGUSR1 10
#define SIGSEGV 11
#define SIGUSR2 12
#define SIGPIPE 13
#define SIGALRM 14
#define SIGTERM 15
#define SIGSTKFLT 16
#define SIGCHLD 17
#define SIGCONT 18
#define SIGSTOP 19
#define SIGTSTP 20
#define SIGTTIN 21
#define SIGTTOU 22
#define SIGURG 23
#define SIGXCPU 24
#define SIGXFSZ 25
#define SIGVTALRM 26
#define SIGPROF 27
#define SIGWINCH 28
#define SIGIO 29
#define SIGPOLL SIGIO
3 system_server 异常
// 异常log位置
/data/anr/*
/data/tombstones/*
/data/system/dropbox/*
/sys/fs/pstore/*
3.1 watchdog
Android中Watchdog用来看护system_server进程,system_server进程运行着系统最终要的服务,譬如AMS、PKMS、WMS等, 当这些服务不能正常运转时,Watchdog可能会杀掉system_server,让系统重启。Watchdog的实现利用了锁和消息队列机制。当system_server发生死锁或消息队列一直处于忙碌状态时,则认为系统已经没有响应了(System Not Responding)。在分析Watchdog问题的时候,首先要有详尽的日志,其次要能定位出导致Watchdog超时的直接原因,最重要的是能还原出问题发生的场景。watchdog不是引发重启的原因,其它模块获取锁等相关导致系统卡住,才导致引发watchdog.
dropboxTag 含义
dropboxTag | 含义 |
system_server_anr | system进程无响应 |
system_server_watchdog | system进程发生watchdog |
system_server_crash | system进程崩溃 |
system_server_native_crash | system进程native出现崩溃 |
system_server_wtf | system进程发生严重错误 |
system_server_lowmem | system进程内存不足 |
1 watchdog异常打印
常见watchdog产生的场景有死锁,或者binder等待
下面log中看到Blocked在android.fg线程中, android.fg线程等待thread 130释放锁,需要根据thread 130 线程堆栈情况查看在等什么,层层分析,直到找到最后block的原因,有held by thread字眼打印基本上就可以确定有死锁发生
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
Process: system_server
Subject: Blocked in monitor com.android.server.am.TclActivityManagerService on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on display thread (android.display)
"android.fg" prio=5 tid=12 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x14e00568 self=0xe955fe00
| sysTid=718 nice=0 cgrp=default sched=0/0 handle=0xbe4651c0
| state=S schedstat=( 1528549224 5198733872 10016 ) utm=96 stm=56 core=2 HZ=100
| stack=0xbe362000-0xbe364000 stackSize=1040KB
| held mutexes=
at com.android.server.am.ActivityManagerService.monitor(ActivityManagerService.java:19133)
- waiting to lock <0x06f44807> (a com.android.server.am.TclActivityManagerService) held by thread 130
at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:251)
at android.os.Handler.handleCallback(Handler.java:967)
at android.os.Handler.dispatchMessage(Handler.java:104)
at android.os.Looper.loop(Looper.java:250)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
system_server卡在IPCThreadState::talkWithDriver处,卡在binder处等待服务端返回
Process: system_server
Subject: Blocked in handler on animation thread (android.anim)
Build: TCL/t982_ar31a8/t982_ar31a8:11/unknown/AR01:userdebug/release-keys
"android.anim" prio=5 tid=16 Native
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13280800 self=0xdcf4fc10
| sysTid=19110 nice=-10 cgrp=default sched=0/0 handle=0xb7aec1c0
| state=S schedstat=( 61934857681 8470932389 87042 ) utm=3119 stm=3073 core=0 HZ=100
| stack=0xb79e9000-0xb79eb000 stackSize=1040KB
| held mutexes=
native: #00 pc 00070654 /apex/com.android.runtime/lib/bionic/libc.so (__ioctl+12)
native: #01 pc 0003ebef /apex/com.android.runtime/lib/bionic/libc.so (ioctl+26)
native: #02 pc 00039a43 /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+238)
native: #03 pc 0003a69b /system/lib/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+86)
native: #04 pc 0003a43f /system/lib/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+122)
native: #05 pc 00035297 /system/lib/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+98)
native: #06 pc 0006cf15 /system/lib/libgui.so (android::BpGraphicBufferProducer::dequeueBuffer(int*, android::sp<android::Fence>*, unsigned int, unsigned int, int, unsigned long long, unsigned long long*, android::FrameEventHistoryDelta*)+192)
native: #07 pc 0008320d /system/lib/libgui.so (android::Surface::dequeueBuffer(ANativeWindowBuffer**, int*)+316)
native: #08 pc 00085ef5 /system/lib/libgui.so (android::Surface::lock(ANativeWindow_Buffer*, ARect*)+92)
native: #09 pc 000a4a21 /system/lib/libandroid_runtime.so (android::nativeLockCanvas(_JNIEnv*, _jclass*, long long, _jobject*, _jobject*)+352)
at android.view.Surface.nativeLockCanvas(Native method)
at android.view.Surface.lockCanvas(Surface.java:405)
- locked <0x0d37119c> (a java.lang.Object)
at android.view.ViewRootImpl.drawSoftware(ViewRootImpl.java:4327)
at android.view.ViewRootImpl.draw(ViewRootImpl.java:4288)
at android.view.ViewRootImpl.performDraw(ViewRootImpl.java:3979)
at android.view.ViewRootImpl.performTraversals(ViewRootImpl.java:3231)
at android.view.ViewRootImpl.doTraversal(ViewRootImpl.java:2079)
at android.view.ViewRootImpl$TraversalRunnable.run(ViewRootImpl.java:8376)
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1091)
at android.view.Choreographer.doCallbacks(Choreographer.java:909)
at android.view.Choreographer.doFrame(Choreographer.java:844)
at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:1076)
at android.os.Handler.handleCallback(Handler.java:967)
at android.os.Handler.dispatchMessage(Handler.java:104)
at android.os.Looper.loop(Looper.java:250)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
3.2 app异常
E AndroidRuntime: FATAL EXCEPTION: main
E AndroidRuntime: Process: com.tcl.tv, PID: 1195
E AndroidRuntime: java.lang.IllegalStateException: SharedPreferences in credential encrypted storage are not available until after user is unlocked
E AndroidRuntime: at android.app.ContextImpl.getSharedPreferences(ContextImpl.java:494)
E AndroidRuntime: at android.app.ContextImpl.getSharedPreferences(ContextImpl.java:479)
E AndroidRuntime: at android.content.ContextWrapper.getSharedPreferences(ContextWrapper.java:188)
E AndroidRuntime: at android.preference.PreferenceManager.getDefaultSharedPreferences(PreferenceManager.java:545)
E AndroidRuntime: at com.tcl.tv.SharedData.<init>(SharedData.java:12)
E AndroidRuntime: at com.tcl.tv.SharedData.getInst(SharedData.java:17)
E AndroidRuntime: at com.tcl.tv.recommendchannel.RecommendChannelManager.<init>(RecommendChannelManager.java:65)
E AndroidRuntime: at com.tcl.tv.recommendchannel.RecommendChannelManager.getInstance(RecommendChannelManager.java:75)
E AndroidRuntime: at com.tcl.tv.TVRootApp$CommonHandler.handleMessage(TVRootApp.java:565)
E AndroidRuntime: at android.os.Handler.dispatchMessage(Handler.java:111)
E AndroidRuntime: at android.os.Looper.loop(Looper.java:250)
E AndroidRuntime: at android.app.ActivityThread.main(ActivityThread.java:7848)
E AndroidRuntime: at java.lang.reflect.Method.invoke(Native Method)
E AndroidRuntime: at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:609)
E AndroidRuntime: at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1078)
java.lang.OutOfMemoryError: Failed to allocate a 1583242928 byte allocation with 1556480 free bytes and 255MB until OOM, target footprint 2169528, growth limit 268435456
at java.util.Arrays.copyOf(Arrays.java:3257)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:137)
at com.tcl.tvservice.proxy.TvChannelProxy.updateChannelLists(Unknown Source:12)
at com.tcl.tvservice.proxy.TvChannelProxy.updateChannelInfoInner(Unknown Source:3)
at com.tcl.tvservice.proxy.TvChannelProxy.access$100(Unknown Source:0)
at com.tcl.tvservice.proxy.TvChannelProxy$ChannelProxyHander.handleMessage(Unknown Source:23)
at android.os.Handler.dispatchMessage(Handler.java:111)
at android.os.Looper.loop(Looper.java:250)
at android.os.HandlerThread.run(HandlerThread.java:67)
3.3 Bugreport
bugreport也是收集android系统异常信息的小工具,常见用法如下:
bugreport bugreportz
adb bugreport > Bugreport.txt
bugreport主要收集系统如下信息:
1 系统build以及运行时长等相关信息;
2 内存/CPU/进程等信息;
3 kernel log;
4 lsof、map及Wait-Channels;
5 system log;
6 event log;
7 radio log;
8 vm traces:
9 network相关信息;
10 last kernel log;
11 last system log;
12 ip相关信息;
13 中断向量表
14 property以及fs等信息
15 last radio log;
16 Binder相关信息;
17 dumpsys all:
18 dumpsys checkin相关:
19 dumpsys batterystats电池统计;
20 dumpsys meminfo内存
21 dumpsys netstats网络统计;
22 dumpsys procstats进程统计;
23 dumpsys usagestats使用情况;
24 dumpsys package.
25 dumpsys app相关
3.4 ChkBugReport
ChkBugReport是索尼开发一个jar包用来解析bugreport的工具
https://github.com/sonyxperiadev/ChkBugReport
git clone git://github.com/sonyxperiadev/ChkBugReport.git
3.5 sysrq-trigger
// 导出内存分配信息
echo m > /proc/sysrq-trigger
// 导出当前任务状态信息
echo t > /proc/sysrq-trigger
// 导出当前CPU寄存器和标志位信息
echo p > /proc/sysrq-trigger
3.6 内存debug命令
1 free
2 cat /proc/meminfo
3 procrank
4 dumpsys meminfo
5 cat /proc/slabinfo
6 cat /proc/pagetrace
7 echo m > /proc/sysrq-trigger
8 cat /proc/buddyinfo
9 cat /proc/zoneinfo
4 kernel 内存异常检测工具
4.1 Kasan
CONFIG_KASAN=y
CONFIG_KASAN_STACK_ENABLE=y
4.2 slub_debug
CONFIG_SLUB=y
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y
slabinfo.c 位于tools\vm路径下,通过aarch64-linux-gnu-gcc -o slabinfo slabinfo.c
cat /proc/slabinfo
echo 1 > /sys/kernel/slab/<leaking_slab>/trace
slabinfo -v
4.3 kmemleak
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y则不需要在命令行添加kmemleak=on
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=4096
CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y
打开kmemleak扫描功能:
echo scan > sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak