read timeout:
参考:http://bert82503.iteye.com/blog/2184225
问题
这段时间做爬虫,用redis做URL去重,结果问题频出,首先看几个异常:
redis.clients.jedis.exceptions.JedisDataException: ERR Protocol error: invalid multibulk length
at redis.clients.jedis.Protocol.processError(Protocol.java:127)
at redis.clients.jedis.Protocol.process(Protocol.java:161)
at redis.clients.jedis.Protocol.read(Protocol.java:215)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:340)
at redis.clients.jedis.Connection.getIntegerReply(Connection.java:265)
at redis.clients.jedis.Jedis.zadd(Jedis.java:1385)
at com.shu2man.service.RedisServiceImpl.zadd(RedisServiceImpl.java:73)
at com.shu2man.crawler.Crawler.doCrawl(Crawler.java:68)
at com.shu2man.crawler.CrawlManager$WorkTask.run(CrawlManager.java:138)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
还有这个
redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketException: Connection reset
at redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:202)
at redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
at redis.clients.jedis.Protocol.process(Protocol.java:151)
at redis.clients.jedis.Protocol.read(Protocol.java:215)
at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:340)
at redis.clients.jedis.Connection.getIntegerReply(Connection.java:265)
at redis.clients.jedis.Jedis.zadd(Jedis.java:1385)
at com.shu2man.service.RedisServiceImpl.zadd(RedisServiceImpl.java:73)
at com.shu2man.crawler.Crawler.doCrawl(Crawler.java:68)
at com.shu2man.crawler.CrawlManager$WorkTask.run(CrawlManager.java:138)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
好好的Redis连接,单个操作什么问题都没有,爬虫一跑起来就报错。
分析(猜想)
参考了文首的那篇文章,问题应该是Redis自定义了输入流,爬虫程序中起了很多个线程在跑,当多个线程同时发送命令的时候,造成命令混乱,导致异常。
解决
因为我把Redis封装了,只要把封装函数用synchronized修饰就能避免上述问题。
虽说synchronized修饰一下就能避免以上问题,但根据鄙人经验,很多并发很高的情况都不至于导致命令交织,我的爬虫一秒才几个十来个请求,显然不至于导致这个问题,但问题确实客观存在,还应努力探究出个所以。
以上问题是个人猜想分析,如有高见,还请不吝赐教,欢迎留言。