1.背景
上次通过stata爬取了教育部的政策文件数据,大概了解了爬取数据思路以及正则表达式方面的知识。
但是,用stata在清洗时确实比较费力。
这不,前段时间安装了stata 16,新功能(官方介绍New in Stata 16)中有个比较亮眼的功能,
就是可以在stata中写python语句,调用python的包进行数据处理,
这就给了很大动力去学习python爬虫,
这样的话,就可以实现python爬取数据,再用stata进行处理分析,用stata和python结合出报告啦~
想想都激动呢~
因此,以链家北京在售二手房数据为例,花了2天时间学习并爬取了共84278条数据~
2.学习过程
找学习帖子:找了不少帖子也看了不少python爬取链家数据的帖子,有的要么太简单,有的要么没有注释,完全不知道思路,最后找到了这篇学习帖子:python爬取链家网的房屋数据
-
准备工具:随便度娘就可以实现;
- 下载好python3;
- 下载好sublime3;
- 在sublime 3中进行配置,可以执行运行python;
- 安装chrome浏览器及Xpath插件;
-
学习过程:
- 按照学习帖子,将帖子中的代码一行一行敲下来;
- 然后,一段一段地去理解和运行,不懂就度娘;
- 因为学习帖子里是爬取二手房交易数据,而需求是爬取二手房在售数据,所以,在理解代码之后,根据需求调整修改代码;
-
经过上述过程,最后得到了爬取链家北京在售二手房数据的代码,代码如下:
3. 代码
正如帖子里所述,爬取思路主要是,
先获得每个房屋的链接,再对每个房屋页面内的数据进行提取,
本帖子就先放出获得每个房屋的链接的代码:
# 获取链接html,urllib.request.urlopen import urllib.request # 取消证书验证 import ssl # 解析html,xpath from lxml import etree # 正则表达式,re.findall import re # 线程 import threading # 全局取消证书验证 ssl._create_default_https_context=ssl._create_unverified_context # 获取页面 def get_page(url): page=urllib.request.urlopen(url) html=page.read().decode('utf_8') return html # 获取当前的总页数 def get_page_num(url): try: html=get_page(url) pagenum=re.findall(r'"totalPage":(.+?),"curPage"',html)[0] #只保留括号里面的内容 except: pagenum=0 pagenum=int(pagenum) return pagenum # 获取当前页所有房子的url def get_house_url_current_page(url): # flag='' list_house_url_current_page=[] try: html=get_page(url) selector=etree.HTML(html) house_url_list_li=selector.xpath('/html/body/div[4]/div[1]/ul/li') #修改 for li in house_url_list_li: house_url=li.xpath('div[1]/div[1]/a/@href')[0] #修改 list_house_url_current_page.append(house_url) except: pass return list_house_url_current_page # 获取某个区所有的房屋url def get_house_url_current_distric(district_url_list): list_house_url=[] for district_url in district_url_list: pagenum=get_page_num(district_url) if pagenum==0: print('----------') pagenum=get_page_num(district_url) print(pagenum) print(district_url) print('++++++++++') elif pagenum==1: print(pagenum) url=district_url print(url) list_house_url_current_page=get_house_url_current_page(url) list_house_url.append(list_house_url_current_page) elif pagenum>1: for i in range(1,pagenum+1): print(pagenum) url=district_url.strip('*/') +'pg'+str(i) print(url) list_house_url_current_page=get_house_url_current_page(url) list_house_url.append(list_house_url_current_page) else: pass # 把所有的url拼接成字符串,便于写入本地 str_url='' for row in list_house_url: for url in row: str_url+=url+'\n' return str_url # 把url写到本地 def write_house_url(write_str,district): path_file='C:/study/实战/python/data/zaishou/'+district+".txt" #修改 with open(path_file,'w') as file: file.write(write_str) # 组合所有区的搜索条件url def get_search_url_all_district(): district_url=['https://bj.lianjia.com/ershoufang/dongcheng/', 'https://bj.lianjia.com/ershoufang/xicheng/', 'https://bj.lianjia.com/ershoufang/chaoyang/', 'https://bj.lianjia.com/ershoufang/haidian/', 'https://bj.lianjia.com/ershoufang/fengtai/', 'https://bj.lianjia.com/ershoufang/shijingshan/', 'https://bj.lianjia.com/ershoufang/tongzhou/', 'https://bj.lianjia.com/ershoufang/changping/', 'https://bj.lianjia.com/ershoufang/daxing/', 'https://bj.lianjia.com/ershoufang/yizhuangkaifaqu/', 'https://bj.lianjia.com/ershoufang/shunyi/', 'https://bj.lianjia.com/ershoufang/fangshan/', 'https://bj.lianjia.com/ershoufang/mentougou/', 'https://bj.lianjia.com/ershoufang/pinggu/', 'https://bj.lianjia.com/ershoufang/huairou/', 'https://bj.lianjia.com/ershoufang/miyun/', 'https://bj.lianjia.com/ershoufang/yanqing/'] # 组合搜索 # 面积 search_area=['a1','a2','a3','a4','a5','a6','a7','a8'] # 价格 search_price=['p1','p2','p3','p4','p5','p6','p7','p8'] # 组合搜索条件url search_url=[] for url in district_url: url_list=[] for area in search_area: for price in search_price: url_=url+area+price+'/' url_list.append(url_) search_url.append(url_list) return search_url def main(index): list_district=['dongcheng','xicheng','chaoyang','haidian', 'fengtai','shijingshan','tongzhou','changping','daxing', 'yizhuangkaifaqu','shunyi','fangshan','mentougou', 'pinggu','huairou','miyun','yanqing'] district=list_district[index] search_url=get_search_url_all_district() district_url_list=search_url[index] write_str=get_house_url_current_distric(district_url_list) write_house_url(write_str,district) if __name__=='__main__': # 根据需求调节线程数 for index in range(0,17): thread=threading.Thread(target=main,args=(index,)) print(threading.active_count()) thread.start()