1.背景
最近为了方便监控系统的一个调用链路情况,所以在测试环境引入了公司的skywalking组件。
引入方式也比较简单,就是在jvm参数里面增加:
-javaagent:${HOME}/apm-agent/skywalking-agent.jar
随后我基本上将所有的服务都加好了,但是其中有一个服务每次加了以后,过几秒钟就内存溢出,打印出了dump堆栈。一还原就好了。
2.排查
1.拿到了堆栈信息后,我首先就是用MAT看一下堆的信息
看了一下堆信息,这个是很正常的,因为我的老年代的设置的是2G,远远没有达到。所以继续观察他的gc日志情况。
2021-11-11T16:33:58.524+0800: 323.029: [Full GC (Metadata GC Threshold) 2021-11-11T16:33:58.524+0800: 323.029: [CMS2021-11-11T16:33:59.179+0800: 323.684: [CMS-concurrent-mark: 0.655/0.673 secs] [Times: user=1.39 sys=0.01, real=0.67 secs]
(concurrent mode failure): 198752K->198769K(2975360K), 1.2596357 secs] 207347K->198769K(3128704K), [Metaspace: 125871K->125871K(1163264K)], 1.2605925 secs] [Times: user=1.95 sys=0.01, real=1.26 secs]
2021-11-11T16:33:59.784+0800: 324.289: [Full GC (Last ditch collection) 2021-11-11T16:33:59.784+0800: 324.289: [CMS: 198769K->198758K(2975360K), 0.5616954 secs] 198769K->198758K(3128704K), [Metaspace: 125871K->125871K(1163264K)], 0.5626714 secs] [Times: user=0.56 sys=0.01, real=0.56 secs]
2021-11-11T16:34:00.348+0800: 324.853: [Full GC (Metadata GC Threshold) 2021-11-11T16:34:00.348+0800: 324.853: [CMS: 198758K->198758K(2975360K), 0.5353440 secs] 199081K->198758K(3128704K), [Metaspace: 125871K->125871K(1163264K)], 0.5362917 secs] [Times: user=0.53 sys=0.00, real=0.54 secs]
2021-11-11T16:34:00.884+0800: 325.389: [Full GC (Last ditch collection) 2021-11-11T16:34:00.884+0800: 325.389: [CMS: 198758K->198758K(2975360K), 0.5356921 secs] 198758K->198758K(3128704K), [Metaspace: 125871K->125871K(1163264K)], 0.5365914 secs] [Times: user=0.54 sys=0.00, real=0.53 secs]
2021-11-11T16:34:01.422+0800: 325.927: [GC (GCLocker Initiated GC) 2021-11-11T16:34:01.422+0800: 325.927: [ParNew: 744K->17K(153344K), 0.0121477 secs] 199503K->198775K(3128704K), 0.0122225 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2021-11-11T16:34:01.435+0800: 325.940: [GC (CMS Initial Mark) [1 CMS-initial-mark: 198758K(2975360K)] 199351K(3128704K), 0.0031080 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2021-11-11T16:34:01.438+0800: 325.943: [CMS-concurrent-mark-start]
2021-11-11T16:34:01.441+0800: 325.946: [Full GC (Metadata GC Threshold) 2021-11-11T16:34:01.441+0800: 325.946: [CMS2021-11-11T16:34:01.981+0800: 326.485: [CMS-concurrent-mark: 0.541/0.543 secs] [Times: user=1.16 sys=0.01, real=0.54 secs]
(concurrent mode failure): 198758K->198813K(2975360K), 1.1262069 secs] 203361K->198813K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 1.1271317 secs] [Times: user=1.73 sys=0.02, real=1.13 secs]
2021-11-11T16:34:02.568+0800: 327.073: [Full GC (Last ditch collection) 2021-11-11T16:34:02.568+0800: 327.073: [CMS: 198813K->198793K(2975360K), 0.5388929 secs] 198813K->198793K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 0.5398068 secs] [Times: user=0.53 sys=0.00, real=0.54 secs]
2021-11-11T16:34:03.108+0800: 327.613: [Full GC (Metadata GC Threshold) 2021-11-11T16:34:03.108+0800: 327.613: [CMS: 198793K->198793K(2975360K), 0.5337559 secs] 199288K->198793K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 0.5347214 secs] [Times: user=0.52 sys=0.01, real=0.54 secs]
2021-11-11T16:34:03.643+0800: 328.148: [Full GC (Last ditch collection) 2021-11-11T16:34:03.643+0800: 328.148: [CMS: 198793K->198793K(2975360K), 0.5559178 secs] 198793K->198793K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 0.5568365 secs] [Times: user=0.54 sys=0.02, real=0.55 secs]
2021-11-11T16:34:04.201+0800: 328.706: [Full GC (Metadata GC Threshold) 2021-11-11T16:34:04.201+0800: 328.706: [CMS: 198793K->198794K(2975360K), 0.5220216 secs] 198811K->198794K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 0.5229798 secs] [Times: user=0.52 sys=0.00, real=0.53 secs]
2021-11-11T16:34:04.724+0800: 329.229: [Full GC (Last ditch collection) 2021-11-11T16:34:04.724+0800: 329.229: [CMS: 198794K->198794K(2975360K), 0.5336318 secs] 198794K->198794K(3128704K), [Metaspace: 125877K->125877K(1163264K)], 0.5345657 secs] [Times: user=0.54 sys=0.00, real=0.53 secs]
2021-11-11T16:34:05.259+0800: 329.764: [Full GC (Metadata GC Threshold) 2021-11-11T16:34:05.259+0800: 329.764: [CMS
这里就发现了大量的Full GC,并且是Metadata相关的,那么Metadata是属于存放类的加载、常量等的信息的地方。
3.原因
应该是skywalking底层是使用的字节码技术创建了类的,而且刚好这个服务在启动的时候,有一个定时任务大量的触发skywalking去创建类。然后这些临时的类信息放在MetaSpace里,由于还没来得及回收,就塞满了,导致触发gc,但这个时候都还在被引用着。所以最终导致MetaSpace OOM
4.解决办法
我们设置的128M,然后我调大改成了512M,问题就解决了。同时对于这种创建通过反射去处理的逻辑平时也需要考虑元空间的溢出