建立一个自己的IP池还是有很多作用的,比如可以解决爬虫过程中的封IP的问题,当然对我来说,没爬过反爬很严重的网站,还没有遇到过封IP的时候,但是也想弄一个自己的IP池
免费IP的一大缺点是不稳定,很多都用不了,所以如果需求很大的话还是使用付费的更好。对我的IP池来说免费的已经足够了
本文主要实现了爬取免费IP并保存到本地,简单验证IP有效性,并且讲解了如何使用代理IP访问网页
完整的源码在我的GitHub:
GitHub - free-proxy-crawling: self-made ip pool stored in SQLite3, crawling free proxies from websites that offer them.
爬取免费IP
从一些提供免费IP的网站中抓取IP的功能用最基本的Python爬虫实现,爬取的网站有三个:
- http://www.66ip.cn/areaindex_1/1.html
- http://proxylist.fatezero.org/
-
https://www.xicidaili.com/nn/
提供了这三个网站之后,你已经可以自己写代码来抓取了,由于都是最基本的爬虫代码,没有什么技术含量,所以直接上代码,有些地方会有注释,三个网站的抓取代码放在三个函数中,其中66ip.cn
这个网站需要先复制cookie(否则会返回521状态码,是一种反爬措施,复制cookie这种解决方案比较简单)。
import requests
import os
import webbrowser
from bs4 import BeautifulSoup
import json
import pickle
temp_set = set()
def get_xici():
print("getting ip from xicidaili.com...")
headers_xici = {
"Host": "www.xicidaili.com",
"Referer": "https://www.xicidaili.com/nn/1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
# 只爬取xici前3页的IP,后面的验证时间太久了失效的可能性大,ps:这个网站会封IP...
for i in range(3):
ses = requests.session()
ses.get("https://www.xicidaili.com/nn/1")
xici_url = "https://www.xicidaili.com/nn/{}".format(str(i+1))
xici_req = requests.get(xici_url,headers=headers_xici)
print(xici_req.status_code)
if xici_req.status_code == 200:
soup = BeautifulSoup(xici_req.text,'html.parser')
ip_table = soup.find('table',attrs={'id':'ip_list'})
trs = ip_table.find_all('tr')
for i,tr in enumerate(trs):
if i>0:
td = tr.find_all('td')
ip_port = td[1].string + ":" + td[2].string
print(ip_port)
temp_set.add(ip_port)
def get_66ip():
print("getting ip from 66ip.cn...")
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
#"Cookie": "",
"DNT": "1",
"Host": "www.66ip.cn",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
}
webbrowser.open("http://www.66ip.cn/areaindex_1/1.html")
cookie = input("input a valid cookie for 66ip.cn first:")
headers["Cookie"] = cookie
ses = requests.session()
for i in range(26):
fucking_url = "http://www.66ip.cn/areaindex_{}/1.html".format(str(i+1)) #每个地区只有第一页的是最近验证的
addr = ses.get(fucking_url,headers=headers)
if addr.status_code == 200:
soup = BeautifulSoup(addr.content,'html.parser')
table = soup.find_all('table')[2]
trs = table.find_all('tr')
for i,tr in enumerate(trs):
if i > 0:
td = tr.find_all('td')
ip_port = td[0].string+ ":" + td[1].string
print(ip_port)
temp_set.add(ip_port)
def get_freeproxylist():
print("getting ip from freeproxylist...")
fpl_url = "http://proxylist.fatezero.org/proxy.list"
proxy_list = requests.get(fpl_url)
if proxy_list.status_code == 200:
lines = proxy_list.text.split('\n')
for i,line in enumerate(lines):
try:
content = json.loads(line)
except:
continue
if str(content["anonymity"]) == "high_anonymous" and str(content["type"]) == "http":
ip_port = str(content["host"]) + ":" + str(content["port"])
# print(ip_port)
temp_set.add(ip_port)
if i%1000 == 0:
print("processed {} in free proxy list".format(str(i)))
get_xici()
get_66ip()
get_freeproxylist()
f = open("pool.pkl",'wb')
pickle.dump(temp_set,f)
f.close()
只爬取http的高匿IP,先将IP构造成ip:port
的字符串形式存放在temp_set
这个集合中,然后用Python自带的pickle库保存到本地
验证IP有效性
获取到了IP之后,有很多IP是无效的,我们可以运行一个检测程序,看所用的IP是否能够成功访问百度:
import pickle
import requests
import random
def GetUserAgent():
'''
功能:随机获取HTTP_User_Agent
'''
user_agents=[
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
]
user_agent = random.choice(user_agents)
return user_agent
def test_proxy():
test_url = "http://www.baidu.com/"
for ip_port in temp_set:
user_agent = GetUserAgent()
header = {
"User-Agent":user_agent,
}
proxy = {
'http': ip_port,
# 'https': 'https://' + proxy,
}
try:
r = requests.get(test_url,headers=header,proxies=proxy,timeout=5)
print(r.status_code)
if r.status_code != 200:
temp_set.remove(ip_port)
except:
temp_set.remove(ip_port)
print("failed:{}".format(ip_port))
f = open("pool.pkl",'rb')
temp_set = pickle.load(f)
f.close()
test_proxy()
f = open("pool.pkl",'wb')
pickle.dump(temp_set,f)
如何使用代理IP访问网页
IP池最简单的一种使用方式就是刷浏览量了,比如,刷简书文章的浏览量,,,还有,墨墨背单词每日分享的页面,浏览量可以增加单词上限。这些用处还是挺吸引人的吧~
使用代理IP访问网页主要有两种方法,如果是用requests
库,那么方法为:
import requests
proxy = {
'http': ip_port,
# 'https': 'https://' + proxy,
}
r = requests.get(url,headers=header,proxies=proxy)
也可以使用selenium
库,方法为:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server=http://" + ip_port)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
driver.quit()