一、准备工作
1、安装BeautifulSoup4
最快捷的是直接使用pip安装
pip install beautifulsoup4
2、BeautifulSoup4基础教程
基础使用文档链接
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
3、常用方法笔记整理
二、实际项目练习
1、练习网址:http://www.chineseidcard.com/
2、请求接口分析返回数据
http://www.chineseidcard.com/?region=110101&birthday=19900307&sex=1&num=5&r=30
想要的数据就具体的身份证信息
通过分析这些关键信息保存在这个table标签下
<table class="table" style="margin-bottom:0;">
<tbody>
<tr>
<th style="text-align:right;width:20%;vertical-align: middle;"></th>
<td style="vertical-align: middle;">110101199003072631</td>
</tr>
<tr>
<th style="text-align:right;width:20%;vertical-align: middle;"></th>
<td style="vertical-align: middle;">110101199003070492</td>
</tr>
<tr>
<th style="text-align:right;width:20%;vertical-align: middle;"></th>
<td style="vertical-align: middle;">110101199003075314</td>
</tr>
<tr>
<th style="text-align:right;width:20%;vertical-align: middle;"></th>
<td style="vertical-align: middle;">110101199003078398</td>
</tr>
<tr>
<th style="text-align:right;width:20%;vertical-align: middle;"></th>
<td style="vertical-align: middle;">110101199003071532</td>
</tr>
</tbody>
</table>
3、先模拟请求,获取到页面返回数据
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
def gethtml(IDnum):
url = "http://www.chineseidcard.com/"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
params = {
"region":"110101",
"birthday":"19900307",
"sex":"1",
"num":IDnum,
"r":30
}
res = requests.get(url,headers=headers,params=params)
data = json.loads(res.text,encoding="utf-8")
4、BeautifulSoup4来查找标签
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
def gethtml(IDnum):
url = "http://www.chineseidcard.com/"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
params = {
"region":"110101",
"birthday":"19900307",
"sex":"1",
"num":IDnum,
"r":30
}
res = requests.get(url,headers=headers,params=params)
data = json.loads(res.text,encoding="utf-8")
soup = BeautifulSoup(data,"html.parser")
# 获取第2个table标签下的数据
table = soup.find_all('table',class_='table')[1]
#获取单个身份证号
cardID = id.find_all('td')[0].string
5、遍历结果,返回所有身份证号信息
table = soup.find_all('table',class_='table')[1]
这个主要是因为所有返回结果中,身份证信息是保存在第2个table中
#coding:utf-8
from bs4 import BeautifulSoup
import requests
import json
def gethtml(IDnum):
url = "http://www.chineseidcard.com/"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
params = {
"region":"110101",
"birthday":"19900407",
"sex":"1",
"num":IDnum,
"r":30
}
res = requests.get(url,headers=headers,params=params)
data = json.loads(res.text,encoding="utf-8")
soup = BeautifulSoup(data,"html.parser")
# 获取第2个table标签下的数据
table = soup.find_all('table',class_='table')[1]
#获取单个身份证号
# cardID = id.find_all('td')[0].string
#遍历每一个td节点
for i in range(len(table.find_all('td'))):
td_label = table.find_all('td')[i]
#获取td标签下的文本
cardID = td_label.string
print(cardID)
if __name__ == "__main__":
gethtml(5)
返回结果如下:
110101199004070873
110101199004077979
110101199004076853
110101199004079552
110101199004076634