APP-ANR研究

场景

  • 开发期间或者本地测试机器发生的ANR信息

    • trace文件
    • bugreport文件
  • 线上应用获取用户使用过程中发生的ANR信息

    • 集成框架:Matrix、BlockCanary、ANR-WatchDog、xCrash等配合服务器收集等等
    • FileObserver和WatchDog机制去实现,受限于Android系统的SELinux机制,5.0以后基本已经使低权限应用无法监听到trace文件(系统应用>签名应用>第三方应用)
      对比了一些比较热门的实现方案: 优劣点比较看表格(表格出自腾讯云+社区)
      性能监控框架.jpg

https://github.com/Tencent/Matrix
https://github.com/markzhai/AndroidPerformanceMonitor
https://github.com/SalomonBrys/ANR-WatchDog

  • DropBox
    • 简介:
      为了记录历史的logcat日志,Android引入了Dropbox,将历史日志持久化到磁盘中(/data/system/dropbox)。 logcat的缓冲区大小毕竟是有限的,所以需要循环利用,这样历史的日志信息就会被冲掉。在一些自动化测试的场景下,譬如Monkey需要长时间的运行, 就需要把历史的日志全都保存下来。
      代码中使用DropboxManager来记录需要相对持久化存储的错误日志信息,理论上也是一种可行方式。 没用过,待验证。

traces文件

  • 简介
    当发生ANR时,系统会将相应的堆栈信息记录在traces文件里,且只会保留最后一次发生ANR时的traces信息。
    Android 5.0 以下,记录在 /data/anr/traces.txt 文件中。
    Android 5.0 以上,记录在 /data/anr/anr_* 等文件。

  • 获取方式

    • 通过adb命令 adb pull /data/anr (需要root权限)
    • 如果无法获取root权限,使用adb bugreport获取的bugreport压缩包里也带有anr trace文件和tombstones文件(需要打开开发者选项)
  • 日志解读

    • 如果trace日志文件有效:
      可以获取很多有用的信息,比如直接的ANR的callstack会直接打印出来,如果打印出来的callstack并不能定位问题所在,可以继续排查发生ANR的进程、线层,时间范围等线索,去查看当时的Log,验证问题。
    • trace日志文件无效,被新的ANR覆盖等:
      那么便无法通过trace文件获取有效信息,因为dump出来的trace文件已经丢失ANR现场,只能尝试去解读log分析,或者等待复现获取有效trace文件。

例子:

----- pid 16271 at 2020-07-22 14:39:25 -----   // 发生ANR的进程pid和时间,与log文件里的需要对应起来,否则便不是目标treace堆栈,可能被覆盖了
Cmd line: com.example.myexoplayervideodemo     // 发生ANR的进程名
Build fingerprint: 'Xiaomi/cepheus/cepheus:10/QKQ1.190825.002/V11.0.9.0.QFACNXM:user/release-keys'   // 接下来一堆都是系统版本,内存等状态信息
ABI: 'arm64'
Build type: optimized
Zygote loaded classes=8182 post zygote classes=562
Dumping registered class loaders
#0 dalvik.system.PathClassLoader: [], parent #1
#1 java.lang.BootClassLoader: [], no parent
#2 dalvik.system.PathClassLoader: [/system/framework/tcmclient.jar], parent #0
#3 dalvik.system.PathClassLoader: [], parent #0
#4 dalvik.system.PathClassLoader: [/data/app/com.example.myexoplayervideodemo-qBeUzwMwbXCPAFlijKmhvA==/base.apk:/data/app/com.example.myexoplayervideodemo-qBeUzwMwbXCPAFlijKmhvA==/base.apk!classes2.dex], parent #1
#5 dalvik.system.PathClassLoader: [/system/app/MiuiContentCatcher/MiuiContentCatcher.apk], parent #1
#6 dalvik.system.PathClassLoader: [/system/app/CatcherPatch/CatcherPatch.apk], parent #1
...

suspend all histogram:  Sum: 50.298ms 99% C.I. 1.311us-1228.373us Avg: 44.868us Max: 14539us
DALVIK THREADS (14):
"Signal Catcher" daemon prio=5 tid=7 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x13900c38 self=0x7434399400
  | sysTid=16308 nice=0 cgrp=default sched=0/0 handle=0x7440d1dd50
  | state=R schedstat=( 5896410 661457 17 ) utm=0 stm=0 core=6 HZ=100
  | stack=0x7440c27000-0x7440c29000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 000000000041098c  /apex/com.android.runtime/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+140)
  native: #01 pc 00000000004f80e4  /apex/com.android.runtime/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+512)
  native: #02 pc 0000000000512a20  /apex/com.android.runtime/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+828)
  native: #03 pc 000000000050b844  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+456)
  native: #04 pc 000000000050ad28  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool)+1964)
  native: #05 pc 000000000050a408  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+844)
  native: #06 pc 00000000004c581c  /apex/com.android.runtime/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+200)
  native: #07 pc 00000000004d9c54  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1352)
  native: #08 pc 00000000004d8d00  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::Run(void*)+252)
  native: #09 pc 00000000000d6cb0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36)
  native: #10 pc 0000000000074eac  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
  (no managed stack frames)

"main" prio=5 tid=1 Native   // // 线程名,优先级,线程号,线程状态
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x75b073c8 self=0x744c869c00  // 线程组名,被挂起次数,线程对象的地址等
  | sysTid=16271 nice=-10 cgrp=default sched=0/0 handle=0x74d2a44ed0     // sysTid是Linux下的内核线程id
  | state=S schedstat=( 3322210770 125918324 2742 ) utm=89 stm=243 core=4 HZ=100
  | stack=0x7ffbb28000-0x7ffbb2a000 stackSize=8192KB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/16271/stack)    // 下面都是调用栈信息,头部几行是native日志,可以通过tombstone一文介绍的方法进行解读定位代码行数
  native: #00 pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  native: #01 pc 000000000014b1f4  /apex/com.android.runtime/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
  native: #02 pc 000000000058f958  /apex/com.android.runtime/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE.llvm.1098154004112041116+440)
  native: #03 pc 000000000058f75c  /apex/com.android.runtime/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
  at libcore.io.Linux.fdatasync(Native method)
  at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
  at libcore.io.BlockGuardOs.fdatasync(BlockGuardOs.java:167)
  at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
  at android.system.Os.fdatasync(Os.java:168)
  at java.io.RandomAccessFile.maybeSync(RandomAccessFile.java:306)
  at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:547)
  at java.io.RandomAccessFile.write(RandomAccessFile.java:559)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.writeTxtToFile(AnrTestActivity.java:99)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$100(AnrTestActivity.java:24)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$3.onNext(AnrTestActivity.java:66)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$3.onNext(AnrTestActivity.java:50)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmitScalar(ObservableFlatMap.java:234)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:146)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
  at io.reactivex.internal.operators.observable.ObservableBuffer$BufferExactObserver.onNext(ObservableBuffer.java:113)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmit(ObservableFlatMap.java:262)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$InnerObserver.onNext(ObservableFlatMap.java:554)
  at io.reactivex.observers.SerializedObserver.onNext(SerializedObserver.java:111)
  at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver$InnerObserver.onNext(ObservableConcatMap.java:249)
  at io.reactivex.internal.operators.observable.ObservableScalarXMap$ScalarDisposable.run(ObservableScalarXMap.java:246)
  at io.reactivex.internal.operators.observable.ObservableJust.subscribeActual(ObservableJust.java:35)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver.drain(ObservableConcatMap.java:223)
  at io.reactivex.internal.operators.observable.ObservableConcatMap$SourceObserver.onSubscribe(ObservableConcatMap.java:103)
  at io.reactivex.internal.operators.observable.ObservableFromIterable.subscribeActual(ObservableFromIterable.java:54)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableConcatMap.subscribeActual(ObservableConcatMap.java:52)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:162)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.tryEmitScalar(ObservableFlatMap.java:234)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.subscribeInner(ObservableFlatMap.java:146)
  at io.reactivex.internal.operators.observable.ObservableFlatMap$MergeObserver.onNext(ObservableFlatMap.java:139)
  at io.reactivex.internal.operators.observable.ObservableFromArray$FromArrayDisposable.run(ObservableFromArray.java:105)
  at io.reactivex.internal.operators.observable.ObservableFromArray.subscribeActual(ObservableFromArray.java:35)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableBuffer.subscribeActual(ObservableBuffer.java:44)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at io.reactivex.internal.operators.observable.ObservableFlatMap.subscribeActual(ObservableFlatMap.java:55)
  at io.reactivex.Observable.subscribe(Observable.java:10685)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.createIoAnr(AnrTestActivity.java:50)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$000(AnrTestActivity.java:24)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity$1.onClick(AnrTestActivity.java:34)
  at android.view.View.performClick(View.java:7163)
  at android.view.View.performClickInternal(View.java:7140)
  at android.view.View.access$3500(View.java:813)
  at android.view.View$PerformClick.run(View.java:27428)
  at android.os.Handler.handleCallback(Handler.java:883)
  at android.os.Handler.dispatchMessage(Handler.java:100)
  at android.os.Looper.loop(Looper.java:224)
  at android.app.ActivityThread.main(ActivityThread.java:7520)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:539)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:950)

bugreport

  • 简介
    bugreport是android系统自带的一套日志收集机制,bugreport信息量非常之大,几乎涵盖整个系统各个层面内容,对于分析BUG是一大利器。

  • 获取方式
    adb bugreport
    会得到一个压缩文件,里面有很多文件,最重要的一个文件便是bugreport.txt,里面包含了巨量日志,当分析问题没有线索的时候,也可以考虑去看下这个文件,比较全面。

  • 日志解读

    • anr log:
      发生ANR时,导出并打开bugreport.txt文件,搜索关键字"VM TRACES AT LAST ANR",一般就能定位到如下的位置,看看进程名和时间跟你的进程对上不,对上那就OK了,顺着往下翻就是这次ANR的trace信息了。
------ VM TRACES AT LAST ANR (/data/anr/anr_2020-07-22-14-39-25-303: 2020-07-22 14:39:36) ------

----- pid 16271 at 2020-07-22 14:39:25 -----
Cmd line: com.example.myexoplayervideodemo
Build fingerprint: 'Xiaomi/cepheus/cepheus:10/QKQ1.190825.002/V11.0.9.0.QFACNXM:user/release-keys'
ABI: 'arm64'
Build type: optimized
Zygote loaded classes=8182 post zygote classes=562
Dumping registered class loaders
#0 dalvik.system.PathClassLoader: [], parent #1
#1 java.lang.BootClassLoader: [], no parent
#2 dalvik.system.PathClassLoader: [/system/framework/tcmclient.jar], parent #0

...

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x75b073c8 self=0x744c869c00
  | sysTid=16271 nice=-10 cgrp=default sched=0/0 handle=0x74d2a44ed0
  | state=S schedstat=( 3322210770 125918324 2742 ) utm=89 stm=243 core=4 HZ=100
  | stack=0x7ffbb28000-0x7ffbb2a000 stackSize=8192KB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/16271/stack)
  native: #00 pc 000000000007067c  /apex/com.android.runtime/lib64/bionic/libc.so (syscall+28)
  native: #01 pc 000000000014b1f4  /apex/com.android.runtime/lib64/libart.so (art::ConditionVariable::WaitHoldingLocks(art::Thread*)+148)
  native: #02 pc 000000000058f958  /apex/com.android.runtime/lib64/libart.so (_ZN3artL12GoToRunnableEPNS_6ThreadE.llvm.1098154004112041116+440)
  native: #03 pc 000000000058f75c  /apex/com.android.runtime/lib64/libart.so (art::JniMethodEnd(unsigned int, art::Thread*)+28)
  at libcore.io.Linux.fdatasync(Native method)
  at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
  at libcore.io.BlockGuardOs.fdatasync(BlockGuardOs.java:167)
  at libcore.io.ForwardingOs.fdatasync(ForwardingOs.java:107)
  at android.system.Os.fdatasync(Os.java:168)
  at java.io.RandomAccessFile.maybeSync(RandomAccessFile.java:306)
  at java.io.RandomAccessFile.writeBytes(RandomAccessFile.java:547)
  at java.io.RandomAccessFile.write(RandomAccessFile.java:559)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.writeTxtToFile(AnrTestActivity.java:99)
  at com.example.myexoplayervideodemo.anrdemo.AnrTestActivity.access$100(AnrTestActivity.java:24)

像这个ANR,提示很清晰,也很容易看出来,是IO问题,那么这时我们可以去看一下发生ANR这个时间段前后的CPU使用情况以及IO.
接下来去看下 【14.39.25】 前后时间内的CPU使用情况: 如果没有cpu日志,则需要去看ANR时间段附近的log,过滤筛选查找线索。
(由于这个是Demo模拟发生的IO异常,callstack直接就可以定位到IO异常的代码行数了,不过我们这边做只是用来说明下这种问题的排查方法)

  • cpu log:
    AMS里打印cpu相关的log的代码:
EventLog.writeEvent(EventLoTags.CPU,
    ((user+system+iowait+irq+softIrq) * 100) / total,
    (user * 100) / total,
    (system * 100) / total,
    (iowait * 100) / total,
    (irq * 100) / total,
    (softIrq * 100) /total);

5.4% TOTAL: 2.6% user + 2.1% kernel + 0.1% iowait + 0.3% irq + 0.1% softirq

07-22 14:39:36.217  1000  1607  1770 I system_server: libdebuggerd_client: done dumping process 29355
07-22 14:39:36.223  1000  1607  1770 I ActivityManager: Done dumping
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: ANR in com.example.myexoplayervideodemo (com.example.myexoplayervideodemo/.anrdemo.AnrTestActivity)
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: PID: 16271
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Reason: Input dispatching timed out (com.example.myexoplayervideodemo/com.example.myexoplayervideodemo.anrdemo.AnrTestActivity, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago.  Wait queue length: 19.  Wait queue head age: 9042.8ms.)
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Parent: com.example.myexoplayervideodemo/.anrdemo.AnrTestActivity
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: Load: 0.0 / 0.0 / 0.0
07-22 14:39:36.257  1000  1607  1770 E ActivityManager: CPU usage from 0ms to 11562ms later (2020-07-22 14:39:24.661 to 2020-07-22 14:39:36.223):
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.6% 1198/media.codec: 1.1% user + 0.5% kernel / faults: 82688 minor 74 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   28% 16271/com.example.myexoplayervideodemo: 2.4% user + 25% kernel / faults: 3490 minor 22 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   9.1% 1607/system_server: 4.1% user + 5% kernel / faults: 7695 minor 45 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   7.7% 873/surfaceflinger: 0.4% user + 7.3% kernel / faults: 480 minor
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   7.6% 29355/com.netease.cloudmusic: 4.3% user + 3.3% kernel / faults: 3761 minor 30 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   5.8% 26832/com.xiaomi.smarthome: 3.3% user + 2.5% kernel / faults: 2441 minor 66 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   5.2% 637/jbd2/sda31-8: 0% user + 5.2% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   3% 147/kswapd0: 0% user + 3% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0.1% 10301/kworker/u16:2: 0% user + 0.1% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.9% 9564/kworker/u16:4: 0% user + 2.9% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0% 769/android.hardware.camera.provider@2.4-service_64: 0% user + 0% kernel / faults: 53 minor 18 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.8% 9364/kworker/u16:6: 0% user + 2.8% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   0.1% 1138/media.extractor: 0% user + 0% kernel / faults: 10157 minor 11 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.4% 11519/kworker/u16:10: 0% user + 2.4% kernel
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   2.4% 12915/com.xunmeng.pinduoduo:titan: 1.8% user + 0.6% kernel / faults: 905 minor
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.9% 2770/com.android.phone: 1.1% user + 0.7% kernel / faults: 3830 minor 35 major
07-22 14:39:36.257  1000  1607  1770 E ActivityManager:   1.9% 13728/kworker/u16:0: 0% user + 1.9% kernel
...

log打印了ANR的基本信息,我们可以分析CPU使用率推测ANR发生的时候设备在做什么工作;如果CPU使用率很高,接近100%,可能是在进行大规模的计算更可能是陷入死循环;如果CUP使用率很低,说明主线程被阻塞了,并且当IOwait很高,可能是主线程在等待I/O操作的完成.
此处CPU使用率不高,这里我们可以排除CPU使用率高的情况,怀疑是主线程被阻塞,IO异常就是造成主线程阻塞的原因之一,也和我们之前的ANR callstack呼应。

关注ANR的reason字段对分析ANR问题也有很大帮助,此处:
Reason: Input dispatching timed out (com.example.myexoplayervideodemo/com.example.myexoplayervideodemo.anrdemo.AnrTestActivity, Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 19. Wait queue head age: 9042.8ms.)
输入事件未响应导致ANR,也印证了主线程阻塞,结合callstack或者cpu相关的log,就能大致知道发生ANR的原因了。

对于ANR只是分析Log很难知道问题所在,我们还需要通过Trace文件分析stack调用情况,两者结合去分析。
当我们在bugreport.txt里没找到想要的callstack信息,就去trace文件里找看看,反之亦然。

  • ANR的问题千奇百怪,具体问题具体分析,经验总结看《ANR&Watchdog分析总结》一文。
  • bugreport还有一些其他可视化的阅读工具:
    • Battery Historian的
      由于bugreport.txt涵盖了太多的数据,解读起来十分困难,因此google针对android 5.0(api 21)以上的系统开发了一个叫做battery historian的分析工具,这个工具就是用来解析这个txt文本文件,然后使用web图形的形式展现出来,这样出来的效果更加人性化,更加可读.
      使用方法:
      https://github.com/google/battery-historian

    • ChkBugReport
      Sony开源的ChkBugReport
      https://github.com/sonyxperiadev/ChkBugReport

PS: 顺便介绍下Watchdog

Watchdog

  • 简介
    Android中的watchdog功能,用于监视系统的运行,确保一些重要的系统服务要保持正常运行,当出现故障时(死锁),通常会让Android系统重启。由于这种机制的存在,保证了Android系统正常稳定运行。
    Watchdog可以理解为System Not Responding。(有点类似ANR,只是属于系统层面)

  • 获取方式
    同ANR,也是去获取trace文件等
    bugreport、tombstone、logcat、trace等

  • 解读
    Watchdog相关的问题需要大量日志进行验证才能去定位问题,对日志要求比较高。

例子:

  • 发生Watchdog的时候会有如下日志:
05-23 07:34:45.038  1000  1655  1856 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in handler on foreground thread (android.fg), Blocked in handler on main thread (main), Blocked in handler on display thread (android.display), Blocked in handler on ActivityManager (ActivityManager)
05-23 07:34:45.040  1000  1655  1856 W Watchdog: android.fg annotated stack trace:
05-23 07:34:45.040  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:16218)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at android.app.ContextImpl.sendOrderedBroadcastAsUser(ContextImpl.java:1317)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at android.app.ContextImpl.sendOrderedBroadcastAsUser(ContextImpl.java:1279)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService.sendDataActivityBroadcast(ConnectivityService.java:2372)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService.access$600(ConnectivityService.java:245)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.ConnectivityService$4.interfaceClassDataActivityChanged(ConnectivityService.java:1714)
05-23 07:34:45.041  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.lambda$notifyInterfaceClassActivity$5(NetworkManagementService.java:458)
05-23 07:34:45.044  1000  1655  1856 W Watchdog:     at com.android.server.-$$Lambda$NetworkManagementService$D43p3Tqq7B3qaMs9AGb_3j0KZd0.sendCallback(Unknown Source:6)
05-23 07:34:45.044  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.invokeForAllObservers(NetworkManagementService.java:358)
05-23 07:34:45.044  1000  1655  1856 W Watchdog:     - locked <0x037929aa> (a android.os.RemoteCallbackList)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.notifyInterfaceClassActivity(NetworkManagementService.java:458)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService.access$1200(NetworkManagementService.java:122)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.NetworkManagementService$NetdUnsolicitedEventListener.lambda$onInterfaceClassActivityChanged$0$NetworkManagementService$NetdUnsolicitedEventListener(NetworkManagementService.java:705)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at com.android.server.-$$Lambda$NetworkManagementService$NetdUnsolicitedEventListener$0xWa9DGxTnoGVHppsM-nng2PygE.run(Unknown Source:10)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
05-23 07:34:45.045  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
05-23 07:34:45.046  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
05-23 07:34:45.048  1000  1655  1856 W Watchdog: main annotated stack trace:
05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.isAppBad(ActivityManagerService.java:6181)
05-23 07:34:45.048  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService.access$2000(ActivityManagerService.java:426)
05-23 07:34:45.048  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$LocalService.isAppBad(ActivityManagerService.java:19344)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.JobSchedulerService.isComponentUsable(JobSchedulerService.java:2353)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.JobSchedulerService.areComponentsInPlaceLocked(JobSchedulerService.java:2400)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.StateController.wouldBeReadyWithConstraintLocked(StateController.java:130)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.TimeController.checkExpiredDelaysAndResetAlarm(TimeController.java:325)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     - locked <0x0a8d85de> (a java.lang.Object)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.job.controllers.TimeController$2.onAlarm(TimeController.java:451)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.app.AlarmManager$ListenerWrapper.run(AlarmManager.java:252)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.SystemServer.run(SystemServer.java:585)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.server.SystemServer.main(SystemServer.java:381)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at java.lang.reflect.Method.invoke(Native Method)
05-23 07:34:45.049  1000  1655  1856 W Watchdog:     at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:919)
05-23 07:34:45.050  1000  1655  1856 W Watchdog: android.display annotated stack trace:
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$LocalService.updateCpuStats(ActivityManagerService.java:18954)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.wm.-$$Lambda$LYW1ECaEajjYgarzgKdTZ4O1fi0.run(Unknown Source:2)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Handler.handleCallback(Handler.java:883)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:100)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
05-23 07:34:45.050  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
05-23 07:34:45.053  1000  1655  1856 W Watchdog: ActivityManager annotated stack trace:
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.am.ActiveServices.serviceTimeout(ActiveServices.java:3877)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.am.ActivityManagerService$MainHandler.handleMessage(ActivityManagerService.java:1944)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:107)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.Looper.loop(Looper.java:241)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:67)
05-23 07:34:45.053  1000  1655  1856 W Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
05-23 07:34:45.053  1000  1655  1856 W Watchdog: *** GOODBYE!

从上面的log可以看到,由于在等待锁(0x00604b97)导致了Watchdog,咱们继续追踪这个锁(0x00604b97)是怎么回事:
在trace文件里找到下面这个log:

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73039fc8 self=0x79b89e8c00
  | sysTid=1655 nice=-2 cgrp=default sched=0/0 handle=0x79b9f4ded0
  | state=S schedstat=( 339526112412 184124053201 1109981 ) utm=21067 stm=12885 core=4 HZ=100
  | stack=0x7ff8d71000-0x7ff8d73000 stackSize=8192KB
  | held mutexes=
  at com.android.server.am.ActivityManagerService.isAppBad(ActivityManagerService.java:6181)
  - waiting to lock <0x00604b97> (a com.android.server.am.ActivityManagerService) held by thread 138
  at com.android.server.am.ActivityManagerService.access$2000(ActivityManagerService.java:426)
  at com.android.server.am.ActivityManagerService$LocalService.isAppBad(ActivityManagerService.java:19344)
  at com.android.server.job.JobSchedulerService.isComponentUsable(JobSchedulerService.java:2353)
  at com.android.server.job.JobSchedulerService.areComponentsInPlaceLocked(JobSchedulerService.java:2400)
  at com.android.server.job.controllers.StateController.wouldBeReadyWithConstraintLocked(StateController.java:130)
  at com.android.server.job.controllers.TimeController.checkExpiredDelaysAndResetAlarm(TimeController.java:325)
  - locked <0x0a8d85de> (a java.lang.Object)
  at com.android.server.job.controllers.TimeController$2.onAlarm(TimeController.java:451)
  at android.app.AlarmManager$ListenerWrapper.run(AlarmManager.java:252)
  at android.os.Handler.handleCallback(Handler.java:883)
  at android.os.Handler.dispatchMessage(Handler.java:100)
  at android.os.Looper.loop(Looper.java:241)
  at com.android.server.SystemServer.run(SystemServer.java:585)
  at com.android.server.SystemServer.main(SystemServer.java:381)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:919)

可以看到锁 0x00604b97 是因为thread 138造成的,接着看下thread 138是什么情况:

"Binder:1655_1D" prio=5 tid=138 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x19d9bb88 self=0x785c6f3800
  | sysTid=12593 nice=-2 cgrp=default sched=0/0 handle=0x7867541d50
  | state=S schedstat=( 62957020280 28452871663 279796 ) utm=4350 stm=1945 core=5 HZ=100
  | stack=0x786744b000-0x786744d000 stackSize=991KB
  | held mutexes=
  at com.android.server.wm.ActivityTaskManagerService$LocalService.getTopProcessState(ActivityTaskManagerService.java:6595)
  - waiting to lock <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock) held by thread 19
  at com.android.server.am.OomAdjuster.computeOomAdjLocked(OomAdjuster.java:893)
  at com.android.server.am.OomAdjuster.updateOomAdjLocked(OomAdjuster.java:365)
  at com.android.server.am.ActivityManagerService.updateOomAdjLocked(ActivityManagerService.java:17818)
  at com.android.server.am.ActiveServices.realStartServiceLocked(ActiveServices.java:2703)
  at com.android.server.am.ActiveServices.bringUpServiceLocked(ActiveServices.java:2590)
  at com.android.server.am.ActiveServices.bindServiceLocked(ActiveServices.java:1805)
  at com.android.server.am.ActivityManagerService.bindIsolatedService(ActivityManagerService.java:14717)
  - locked <0x00604b97> (a com.android.server.am.ActivityManagerService)
  at android.app.IActivityManager$Stub.onTransact(IActivityManager.java:2377)
  at com.android.server.am.ActivityManagerService.onTransact(ActivityManagerService.java:3101)
  at android.os.Binder.execTransactInternal(Binder.java:1036)
  at android.os.Binder.execTransact(Binder.java:1003)

而锁0x0a36c171又是因为(held by thread 19),咱们接着找下thread 19:

"android.anim" prio=5 tid=19 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x19d80998 self=0x78ca476400
  | sysTid=1851 nice=-10 cgrp=default sched=0/0 handle=0x78c601bd50
  | state=S schedstat=( 131446952824 47039277674 269369 ) utm=9781 stm=3362 core=1 HZ=100
  | stack=0x78c5f19000-0x78c5f1b000 stackSize=1039KB
  | held mutexes=
  kernel: (couldn't read /proc/self/task/1851/stack)
  native: #00 pc 00000000000cd7a4  /apex/com.android.runtime/lib64/bionic/libc.so (__ioctl+4)
  native: #01 pc 000000000008b39c  /apex/com.android.runtime/lib64/bionic/libc.so (ioctl+132)
  native: #02 pc 000000000005829c  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+256)
  native: #03 pc 000000000005915c  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
  native: #04 pc 0000000000058ef8  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+180)
  native: #05 pc 000000000004df38  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
  native: #06 pc 00000000000a3374  /system/lib64/libgui.so (android::BpSurfaceComposer::setTransactionState(android::Vector<android::ComposerState> const&, android::Vector<android::DisplayState> const&, unsigned int, android::sp<android::IBinder> const&, android::InputWindowCommands const&, long, android::client_cache_t const&, std::__1::vector<android::ListenerCallbacks, std::__1::allocator<android::ListenerCallbacks>> const&)+520)
  native: #07 pc 00000000000bf550  /system/lib64/libgui.so (android::SurfaceComposerClient::Transaction::apply(bool)+896)
  at android.view.SurfaceControl.nativeApplyTransaction(Native method)
  at android.view.SurfaceControl.access$400(SurfaceControl.java:77)
  at android.view.SurfaceControl$Transaction.apply(SurfaceControl.java:2273)
  at android.view.SurfaceControl$Transaction.apply(SurfaceControl.java:2255)
  at android.view.SurfaceControl.closeTransaction(SurfaceControl.java:1038)
  - locked <0x0f99c07c> (a java.lang.Class<android.view.SurfaceControl>)
  at com.android.server.wm.WindowManagerService.closeSurfaceTransaction(WindowManagerService.java:929)
  - locked <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock)
  at com.android.server.wm.WindowAnimator.animate(WindowAnimator.java:215)
  - locked <0x0a36c171> (a com.android.server.wm.WindowManagerGlobalLock)
  at com.android.server.wm.WindowAnimator.lambda$new$1$WindowAnimator(WindowAnimator.java:99)
  at com.android.server.wm.-$$Lambda$WindowAnimator$ddXU8gK8rmDqri0OZVMNa3Y4GHk.doFrame(lambda:-1)
  at android.view.Choreographer$CallbackRecord.run(Choreographer.java:965)
  at android.view.Choreographer.doCallbacks(Choreographer.java:791)
  at android.view.Choreographer.doFrame(Choreographer.java:722)
  at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:952)
  at android.os.Handler.handleCallback(Handler.java:883)
  at android.os.Handler.dispatchMessage(Handler.java:100)
  at android.os.Looper.loop(Looper.java:241)
  at android.os.HandlerThread.run(HandlerThread.java:67)
  at com.android.server.ServiceThread.run(ServiceThread.java:44)

提示的错误已经相对明显: android::BpSurfaceComposer::setTransactionState
至此,我们容易追踪的watchdog信息便止步于此,不过我们已经可以知道是SurfaceFlinger相关的错误,在去结合代码和其他Log,也可以继续尝试解读native相关的错误Log进一步排查(native相关的callstack解读在Tombstone文章有介绍到),最后才能定位root cause.

以上便是一个简单的Watchdog问题分析过程。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,902评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,037评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,978评论 0 332
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,867评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,763评论 5 360
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,104评论 1 277
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,565评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,236评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,379评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,313评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,363评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,034评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,637评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,719评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,952评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,371评论 2 346
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,948评论 2 341