到现在还不是很理解代理到底是怎么工作的,也不清楚怎么看出来代理是不是在工作,不过本着“想想办法先干他一炮”的精神,先做出来再说,没准以后有用呢。
网上先撸了一遍,找了几个提供免费代理的网站。
(剧透一下,便宜没好货是永恒的真理,免费的就更不用说了,想要稳定好用的还是花点小钱去买代理,成功率要高些)
拿下代理的过程很简单,就不多说了,放点代码吧。
import requests
from pyquery import PyQuery as pq
base_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8'
}
def crawl_66ip():
start_url = 'http://www.66ip.cn/areaindex_{}/1.html'
urls = [start_url.format(page) for page in range(1, 35)]
for url in urls:
print('Crawling', url)
response = requests.get(url)
html = response.text
if html:
doc = pq(html)
trs = doc('.containerbox table tr:gt(0)').items()
for tr in trs:
ip = tr.find('td:nth-child(1)').text()
port = tr.find('td:nth-child(2)').text()
proxy = ':'.join([ip, port])
def crawl_proxy360():
url = 'http://www.proxy360.cn/Region/China'
response = requests.get(url)
html = response.text
if html:
doc = pq(html)
lines = doc('div[name="list_proxy_ip"]').items()
for line in lines:
ip = line.find('.tbBottomLine:nth-child(1)').text()
port = line.find('.tbBottomLine:nth-child(2)').text()
proxy = ':'.join([ip, port])
def crawl_goubanjia():
url = 'http://www.goubanjia.com/free/gngn/index.shtml'
response = requests.get(url)
html = response.text
if html:
doc = pq(html)
tds = doc('td.ip').items()
for td in tds:
td.find('p').remove()
proxy = td.text().replace(' ', '')
def crawl_haoip():
url = 'http://haoip.cc/tiqu.htm'
response = requests.get(url)
html = response.text
if html:
doc = pq(html)
results = doc('.row .col-xs-12').html().split('<br/>')
for result in results:
if result:
proxy = result.strip()
然后弄个函数把代理弄到mongodb里面去。至于为什么不用其他数据库,因为我还不会啊哈哈哈哈
一样放代码
import pymongo
client = pymongo.MongoClient('localhost',27017)
db =client['proxies']
sheet = db['proxies']
def save_to_mongo(proxy,status):
data = {
'proxy':proxy,
'status':status,
}
sheet.update({'proxy':data['proxy']},data,True)
先放个status字段待会测试用。
存进mongodb
def test_proxy(proxy,url = 'https://www.baidu.com'):
proxies = {
'http':'http://'+proxy,
}
try:
res = requests.get(url,proxies=proxies,timeout= 15,headers=base_headers)
if res.status_code == 200:
print(proxy,'测试成功')
sheet.find_one_and_update({'proxy': proxy}, {'$set': {'status': 1}})
else:
print(proxy,'测试失败')
sheet.find_one_and_update({'proxy': proxy}, {'$set': {'status': 0}})
except Exception:
print(proxy, '测试失败')
sheet.find_one_and_update({'proxy': proxy}, {'$set': {'status': 0}})
通过连接百度测试代理是否可用,如果可用status就会更新成1,找有用的代理时用这个过滤下就好了。(好像测试代理可用都是用的百度,不过我想到时爬什么网站就用什么网站测试,所以把百度射到了url默认值里)
后面就是垃圾时间了,代码一放,等吧。
from multiprocessing import Pool
def get_proxies():
crawl_66ip()
crawl_proxy360()
crawl_haoip()
crawl_goubanjia()
def get_proxies_list():
proxies_list = []
for i in sheet.find():
proxies_list.append(i['proxy'])
return proxies_list
def main():
get_proxies()
proxies_list = get_proxies_list()
pool = Pool()
pool.map(test_proxy,proxies_list)
if __name__ == '__main__':
main()
这样一个简单的代理池就搞定了~
其实没搞定,因为我都不知道能不能用,好不好用,明天用来试试爬来链家的房源信息,觊觎好久了
逻辑很乱,技术很糙,文笔很烂,长得很好o(▽)o