- 爬虫代理的重要性这里就不在赘述了,先贴一张代理池流程图:
1.代理IP抓取
网上免费代理都不靠谱(你懂的),推荐一家代理--讯代理,靠谱.本文选用的是动态切换代理10s请求一次,返回5个代理IP.
while True:
try:
proxies_list = download_proxies() # download_proxies即请求IP代理
thread_list = []
for proxies in proxies_list: # 多线程方式校验联通性并存入redis
t = Thread(target=store_proxies, args=(proxies, ))
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
end = time.time()
if end - begin > 15:
continue
else:
time.sleep(15.5-(end - begin))
except Exception as e:
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
err_logger.error(str(now) + str(e))
time.sleep(10)
2.IP代理连通性测试
采用多线程测试每个代理的连通性,测试网站为要爬取的目标网站
ping_url = 'http://www.xxxx.com'
try:
status_code = requests.get(ping_url, proxies=proxies, timeout=3).status_code # 返回非200,或者报错均认为测试未通过
if status_code == 200:
return True
else:
return False
except Exception as e:
print(e)
return False
3.IP代理存入Redis
对于通过连通测试的IP,设置一个过期时间(如90s),按Key,Value(比如设为1)存入单独一个数据库中(比如1)
conn = redis.Redis(db=1)
conn_check = connect_check(proxies) # 对应2.中的连通测试
if conn_check:
proxies = json.dumps(proxies)
duplicate_check = conn.exist(proxies) # 代理池去重测试
if not duplicate_check:
conn.setex(proxies, 1, time=90) # 设置过期时间并存入redis
print('new proxies: ', proxies)
else:
print(' Already exist proxies: ' + str(proxies))
else:
print(str(now) + ' Can not connect $ping_url -- proxies: ' + str(proxies))
4.Restful接口
用后端框架(如Flask)启个服务提供给爬虫程序