第一:问题是camera APP ANR
第二:log分析
- 从event log中确定ANR的时间点和原因
07-31 05:21:14.132 1000 1410 24564 I am_anr : [0,1700,com.xx.xx.camera,551042629,Input dispatching timed out (ActivityRecord{32c3587 u0 com.xx.xx.camera/.Camera t237} does not have a focused window)]
原因是no focus window类型
从event中接着看anr之前进程的大概的生命周期流程
07-31 05:20:34.818 1000 1410 1426 I wm_add_to_stopping: [0,53228935,com.xx.xx.camera/.Camera,makeInvisible]
07-31 05:20:35.246 1000 1410 1428 I wm_stop_activity: [0,53228935,com.xx.xx.camera/.Camera]
//这里是系统进程打印的,表示系统进程已经触发了stop动作,但是APP那边还没有调用呢
07-31 05:20:40.883 1000 1410 6462 I wm_set_resumed_activity: [0,com.xx.xx.camera/.Camera,bringingFoundTaskToFront]
07-31 05:20:40.892 1000 1410 2077 I wm_set_resumed_activity: [0,com.xx.xx.camera/.Camera,resumeTopActivityInnerLocked]
07-31 05:20:40.894 1000 1410 2077 I wm_resume_activity: [0,53228935,237,com.xx.xx.camera/.Camera]
07-31 05:21:14.132 1000 1410 24564 I am_anr : [0,1700,com.xx.xx.camera,551042629,Input dispatching timed out (ActivityRecord{32c3587 u0 com.xx.xx.camera/.Camera t237} does not have a focused window)]
07-31 05:14:54.275 10079 1700 1700 I wm_on_stop_called: [105910940,com.xx.xx.camera.Camera,LIFECYCLER_STOP_ACTIVITY]
//这种才是真正的执行到了APP的进程,通过前面的进程号也可以看出来。
从上面的event log来看,APP应该是想要执行stop动作,并resume,但是stop并没有调用到,APP的resume方法也没有执行到,那下面就看下AMS log吧
-
AMS log + APP进程log
07-31 05:20:33.787 10079 1700 1700 D WCamera/BaseModuleUI_Capture: onKeyUp keyCode:4//APP这边接到back按键 ... 07-31 05:20:46.246 1000 1410 1428 W ActivityTaskManager: Activity stop timeout for ActivityRecord{32c3587 u0 com.xx.xx.camera/.Camera t237}//从用户按下back键以后到stop timeout,中间没有任何主线程的输出,这是不是很诡异?APP的生命周期竟然没有走,上面的event log也显示出来了APP进程的生命周期没有走。 07-31 05:20:46.246 1000 1410 1428 I ActivityTaskManager: Activity reported stop, but no longer stopping: ActivityRecord{32c3587 u0 com.xx.xx.camera/.Camera t237} 07-31 05:20:50.888 1000 1410 1428 W ActivityTaskManager: Launch timeout has expired, giving up wake lock!
通过AMS查看到在ANR之前camera这边显示stop timeout了,紧接着后面就launch timeout,后面接着就会ANR。
-
trace文件
"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 flags=1 obj=0x743ec290 self=0xb40000779bf45c00 | sysTid=1700 nice=-10 cgrp=default sched=0/0 handle=0x779d694500 | state=S schedstat=( 170541878830 13965398125 451399 ) utm=12381 stm=4672 core=5 HZ=100 | stack=0x7ff130d000-0x7ff130f000 stackSize=8192KB | held mutexes= ..... at android.os.MessageQueue.nativePollOnce(Native method) at android.os.MessageQueue.next(MessageQueue.java:335) at android.os.Looper.loop(Looper.java:206) at android.app.ActivityThread.main(ActivityThread.java:8595) at java.lang.reflect.Method.invoke(Native method) at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:602) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1130)
trace文件竟然显示不卡,这有点奇怪了,因为如果是APP耗时导致anr,trace文件中应该能打到APP的信息才对。但是这里并没有,而且多份log都是类似的现象。那什么原因会导致生命周期不走,并且主线程还不卡,CPU内存等信息也都正常,应用就像停滞不前一样?
通过上面的分析,我们重新看了下消息队列的源码,Message有三种类型,异步消息,屏障消息,同步消息,个人的理解是:屏障消息就像一个过滤门,当队列中有这个屏障时,只有异步消息可以通过,同步消息无法通过。也就是说当屏障消息在队列中时,只有异步消息可以执行,同步消息只能在哪里傻傻的等待。
第三:监控消息队列
Handler.java
public final void dump(@NonNull Printer pw, @NonNull String prefix) {
pw.println(prefix + this + " @ " + SystemClock.uptimeMillis());
if (mLooper == null) {
pw.println(prefix + "looper uninitialized");
} else {
mLooper.dump(pw, prefix + " ");
}
}
Looper.java
public void dump(@NonNull Printer pw, @NonNull String prefix, Handler handler) {
pw.println(prefix + toString());
mQueue.dump(pw, prefix + " ", handler);
}
MessageQueue.java
void dump(Printer pw, String prefix, Handler h) {
synchronized (this) {
long now = SystemClock.uptimeMillis();
int n = 0;
for (Message msg = mMessages; msg != null; msg = msg.next) {
if (h == null || h == msg.target) {
pw.println(prefix + "Message " + n + ": " + msg.toString(now));
}
n++;
}
pw.println(prefix + "(Total messages: " + n + ", polling=" + isPollingLocked()
+ ", quitting=" + mQuitting + ")");
}
}
通过上面的代码可以看到,handler的dump动作会将消息队列中的所有message打印出来,所以我们就新启动了一个线程每隔几秒钟去dump下队列,这种方式可能比较笨拙,有兴趣的同事自行去查看三方APM监控系统的原理,三方的回更详尽
mMainHandler = new Handler(getMainLooper());
startTrace();
private void startTrace() {
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
while (true) {
if (mMainHandler != null) {
mMainHandler.dump(new Printer() {
@Override
public void println(String x) {
Log.d("Main_Message", "dump: " +x);
}
}, "<-->");
}
checkBarrier();
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}, "MessageTrace");
thread.start();
}
最终我们根据跑出来的log,可以断定就是因为屏障消息没有移除导致应用假死状态,log如下
行 17455: 07-27 03:52:25.604709 21248 21271 I Main_Message: dump:<--> Message 0: { when=-3s757ms barrier=25079 }
行 52249: 07-27 03:52:28.608549 21248 21271 I Main_Message: dump:<--> Message 0: { when=-6s761ms barrier=25079 }
行 87576: 07-27 03:52:31.610268 21248 21271 I Main_Message: dump:<--> Message 0: { when=-9s762ms barrier=25079 }
行 122103: 07-27 03:52:34.612396 21248 21271 I Main_Message: dump:<--> Message 0: { when=-12s764ms barrier=25079 }
行 160840: 07-27 03:52:37.615343 21248 21271 I Main_Message: dump:<--> Message 0: { when=-15s767ms barrier=25079 }
第四:屏障消息在何时添加何时移除?
下面看下源码
ViewRootImpl.java
@UnsupportedAppUsage
void scheduleTraversals() {
if (!mTraversalScheduled) {
mTraversalScheduled = true;
mTraversalBarrier = mHandler.getLooper().getQueue().postSyncBarrier();//向消息队列中添加屏障消息
mChoreographer.postCallback(
Choreographer.CALLBACK_TRAVERSAL, mTraversalRunnable, null);//这里会post一个mTraversalRunnable,等vsync过来后就会执行这个callback,然后执行doTraversal()方法移除上面的屏障消息,
notifyRendererOfFramePending();
pokeDrawLockIfNeeded();
}
}
void unscheduleTraversals() {
if (mTraversalScheduled) {
mTraversalScheduled = false;
mHandler.getLooper().getQueue().removeSyncBarrier(mTraversalBarrier);//移除屏障消息
mChoreographer.removeCallbacks(
Choreographer.CALLBACK_TRAVERSAL, mTraversalRunnable, null);
}
}
void doTraversal() {
if (mTraversalScheduled) {
mTraversalScheduled = false;
mHandler.getLooper().getQueue().removeSyncBarrier(mTraversalBarrier);//移除屏障消息
if (mProfile) {
Debug.startMethodTracing("ViewAncestor");
}
performTraversals();
if (mProfile) {
Debug.stopMethodTracing();
mProfile = false;
}
}
}
final class TraversalRunnable implements Runnable {
@Override
public void run() {
doTraversal();
}
}
在ViewRootImpl.java中可以看到调用scheduleTraversals()方法去添加屏障消息的地方很多,有requestChildFocus,clearChildFocus,recomputeViewAttributes,setLayoutParams,handleAppVisibility,handleGetNewSurface,requestFitSystemWindows,notifyInsetsChanged,requestLayout,invalidate,invalidateRectOnScreen,setWindowStopped,handleDispatchSystemUiVisibilityChanged等等。由于我们没有办法调试framework代码,添加屏障的地方有多处,移除的地方应该就一处(这里没算unscheduleTraversals)。从代码来看这里仅仅用一个变量mTraversalScheduled来控制逻辑,没有任何同步代码,所以如果触发并发场景应该很容易导致问题。
MTK在这几个函数里面加了log,我们开log后:
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: scheduleTraversals: mTraversalBarrier = 25075,this = android.view.ViewRootImpl@89e32cc
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: java.lang.Throwable: scheduleTraversals
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewRootImpl.scheduleTraversals(ViewRootImpl.java:1984)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewRootImpl.invalidate(ViewRootImpl.java:1677)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewRootImpl.onDescendantInvalidated(ViewRootImpl.java:1670)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.onDescendantInvalidated(ViewGroup.java:6085)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.onDescendantInvalidated(ViewGroup.java:6085)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.onDescendantInvalidated(ViewGroup.java:6085)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.onDescendantInvalidated(ViewGroup.java:6085)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.onDescendantInvalidated(ViewGroup.java:6085)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.ViewGroup.invalidateChild(ViewGroup.java:6103)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.View.invalidateInternal(View.java:18587)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.View.invalidate(View.java:18544)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.view.View.invalidate(View.java:18526)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.ui.overlay.RenderOverlay.update(RenderOverlay.java:90)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.ui.overlay.OverlayRenderer.update(OverlayRenderer.java:85)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.ui.overlay.ToolTipRenderer.hide(ToolTipRenderer.java:392)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.manager.UIManager.hideToolTipRenderer(UIManager.java:5325)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.moduleui.LiveFocusUI.hideLiveTipsToast(LiveFocusUI.java:638)
////这里就是出问题的点,线程号是21284,是子线程调用的,最终通过invalidate方法调用到ViewRootImpl中的添加屏障
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.module.LiveFocusModule.onPreviewCaptureCompleted(LiveFocusModule.java:644)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at com.xx.xx.camera.one.OneCameraImpl$1.onCaptureCompleted(OneCameraImpl.java:1123)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.hardware.camera2.impl.CameraCaptureSessionImpl$1.lambda$onCaptureCompleted$3$CameraCaptureSessionImpl$1(CameraCaptureSessionImpl.java:692)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.hardware.camera2.impl.-$$Lambda$CameraCaptureSessionImpl$1$OA1Yz_YgzMO8qcV8esRjyt7ykp4.run(Unknown Source:8)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.os.Handler.handleCallback(Handler.java:938)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.os.Handler.dispatchMessage(Handler.java:99)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.os.Looper.loop(Looper.java:223)
07-27 03:52:21.773030 21248 21284 V ViewRootImpl[Camera]: at android.os.HandlerThread.run(HandlerThread.java:67)
虽然在子线程可以调用invalidate,但是很容易出现风险。我们自己写demo启动两个子线程去调用控件的invalidate方法,几秒钟后就会出现卡屏障消息的情况了。重新启动APP就会ANR,而ANR的trace与用户的trace完全一致。
第五:问题出现的原因与解决方案
- 问题原因:子线程调用invalidate方法,导致屏障消息被插入多次,会存在没有移除屏障消息的情况
- 解决方案:在主线程中调用invalidate方法,或者在子线程中调用postInvalidate方法。这两种方式最终都是在主线程中调用scheduleTraversals的,所以不会存在并发的情况
- 说明:像setEnable,setAlpha等方法也会间接调用invalidate方法,所以这些方法尽量也不要在子线程中调用。插入屏障还有好多方法,那些方法尽量也放到主线程中去调用。