问题背景
ES集群的各结点有加进程监控,如果进程不在,监控会自动重新拉起,一般情况下重启不会有问题。今天遇到一个结点一直在报无进程的告警,只能去结点上查看日志。
问题查找
通过日志发现ES结点重启报下面的错误
4) Error injecting constructor, ElasticsearchException[java.io.IOException: failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IOException[failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IllegalStateException[class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)];
at org.elasticsearch.gateway.GatewayMetaState.<init>(Unknown Source)
while locating org.elasticsearch.gateway.GatewayMetaState
Caused by: ElasticsearchException[java.io.IOException: failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IOException[failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IllegalStateException[class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)];
从上面很明显可以看出,ES结点在重启的时候读/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st文件的时候出现异常,查看发现该文件是个空文件,由于pos=-16,因此导致了读文件异常,进而导致进程失败。
ll /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search 0 Oct 1 22:53 /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
问题处理
查询出所有的空文件,直接删除,再重启ES结点就OK了。
find /data*/search/data/nodes/0/indices/ | grep state | grep "\.st" | xargs ls -l | awk '{if($5==0)print $0}'
-rw-rw-r-- 1 search search 0 Oct 1 22:53 /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search 0 Oct 1 22:53 /data3/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search 0 Oct 1 22:53 /data4/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-1.st
-rw-rw-r-- 1 search search 0 Oct 1 22:53 /data4/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
深层原因
通过日志,发现ES结点在重启之前并没有任务异常,后面发现是机器故障,导致st文件为空,进而引起了结点重启失败的问题。
对于这种机器故障的问题,也没有什么好的处理办法,只能将ES数据多备份来避免机器挂了不能正常使用后导致ES丢数据。
如果对于机器不放心,那就只能将机器从ES集群踢掉罗。