1. Watchdog 简介
Android 为了保证系统的高可用性,设计了Watchdog用以监视系统的一些关键服务的运行状况,如果关键服务出现了死锁,将重启SystemServer;另外,接收系统内部reboot请求,重启系统。
总结一下:Watchdog就如下两个主要功能:
- 接收系统内部reboot请求,重启系统;
- 监控系统关键服务,如果关键服务出现了死锁,将重启SystemServer。
被监控的关键服务,这些服务必须实现Watchdog.Monitor接口:
ActivityManagerService
InputManagerService
MountService
NativeDaemonConnector
NetworkManagementService
PowerManagerService
WindowManagerService
MediaRouterService
MediaProjectionManagerService
2. Watchdog 详解
一张图理解 Watchdog
![一张图理解 Watchdog]()
Watchdog 是在SystemServer启动的时候 调用 startOtherServices 启动的。 Watchdog 初始化了一个单例的对象并且继承自 Thread,因此,Watchdog实际是跑在 SystemServer 进程中的。
启动之后,watchdog的run进程会每30s检查一次监控服务是否发生死锁。检查死锁通过hc.scheduleCheckLocked(),然后调用各个被监控对象的monitor()来验证。下面我们以 ActivityManagerService 为例。
/** In this method we try to acquire our lock to make sure that we have not deadlocked */
public void monitor() {
synchronized (this) { }
}
由于我们关键部分都用了synchronized (this) 这个锁来进行锁定,如果我们在monitor()的时候两次每隔30s的检查都未能获取到相应的锁,就表示这个进程死锁,如果死锁将杀死SystemServer进程(Watchdog跑在SystemServer进程中,因此Process.killProcess(Process.myPid()) 这里的myPid()就是SystemServer对应的PID)。
SystemServer 进程被杀死之后, Zygote 也会死掉(com_android_internal_os_Zygote.cpp 中通过 signal 机制 收到 SIGCHLD 就杀掉Zygote进程),最后init进程(init.rc中配置了onrestart,则就会有SVC_RESTARTING标签,init.cpp执行到restart_processes())检测到zygote死掉()会重新启动Zygote 和 SystemServer。
下面,我们结合代码来详细看下这个流程:
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
// 1. 对每个关注的服务进行监控
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
// 2. 等待timeout时间,默认30s
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
// 3. 获取监控之后的waitState状态,如果状态为COMPLETED、WAITING、WAITED_HALF,就结束本次循环,继续执行后面的循环;如果是OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
// We've waited half the deadlock-detection interval. Pull a stack
// trace and wait another half.
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
// 4. OVERDUE状态,则执行OVERDUE相关逻辑,打印log、结束进程。
// something is overdue!
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
// If we got here, that means that the system is most likely hung.
// First collect stack traces from all threads of the system process.
// Then kill this process so that the system will restart.
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
// 5. dump AMS 堆栈信息
// Pass !waitedHalf so that just in case we somehow wind up here without having
// dumped the halfway stacks, we properly re-initialize the trace file.
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a minute, another second or two won't hurt much.
SystemClock.sleep(2000);
// 6. dump kernel 堆栈信息
// Pull our own kernel thread stacks as well if we're configured for that
if (RECORD_KERNEL_THREADS) {
dumpKernelStackTraces();
}
// 7. 触发 kernel dump 所有阻塞的线程信息 和 所有CPU的backtraces放到 kernel 的 log 中
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');
// 8. 尝试把错误信息放大dropbox里面,这个假设AMS还活着,如果AMS死锁了,那watchdog也死锁了
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
// 9. ActivityController 检查 systemNotResponding(subject) 的处理方式,1 = keep waiting, -1 = kill system
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
// 10. 打印堆栈信息
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElement element: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
// 11. 杀死进程
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
Leslie Guan