近来准备开始做一个有关于房价的分析项目,以重新熟练一下之前的爬虫知识,并应用一下近来学习的Tableau作图技巧,本次项目仅做交流使用,非具有任何商业用途。
为了保证信息对地区房价的真实反映,本项目选择链家网作为二手房信息的爬取网站,首先以青岛地区二手房为例进行爬取。
第一步,导入需要用到的库或模块。本次使用urllib库,通过xpath进行网页解析,由于笔者习惯对DataFrame形式的数据进行处理,因此在此导入pandas库。
import urllib.request
from lxml import etree
import pandas as pd
第二步,为了后续的数据框转换更加顺利,在网页解析部分写的有些过于细致,如果你不习惯用DataFrame,可以采用别的数据结构。
house_info = []
for page in range(1,101):
url = 'https://qd.lianjia.com/ershoufang/pg'+str(page)
html = urllib.request.urlopen(url).read().decode('utf-8', 'ignore')
selector = etree.HTML(html)
page_info = selector.xpath('//li[@class="clear LOGCLICKDATA"]')
print('正在爬第'+str(page)+'页')
for i in range(len(page_info)):
house_infor_one = []
title = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/a/text()')
house_infor_one.extend(title if title else ['.'])
way = page_info[i].xpath('div[@class="info clear"]/div[@class="title"]/span/text()')
house_infor_one.extend(way if way else ['.'])
road = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/a/text()')
house_infor_one.extend(road if road else ['.'])
community = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/a/text()')
house_infor_one.extend(community if community else ['.'])
house_des = page_info[i].xpath('div[@class="info clear"]/div[@class="address"]/div/text()')
house_infor_one.extend(house_des if house_des else ['.'])
floor = page_info[i].xpath('div[@class="info clear"]/div[@class="flood"]/div/text()')
house_infor_one.extend(floor if floor else ['.'])
popularity = page_info[i].xpath('div[@class="info clear"]/div[@class="followInfo"]/text()')
house_infor_one.extend(popularity if popularity else ['.'])
subway = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="subway"]/text()')
house_infor_one.extend(subway if subway else ['.'])
taxfree = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="taxfree"]/text()')
house_infor_one.extend(taxfree if taxfree else ['.'])
haskey = page_info[i].xpath('div[@class="info clear"]/div[@class="tag"]/span[@class="haskey"]/text()')
house_infor_one.extend(haskey if haskey else ['.'])
total_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/span/text()')
house_infor_one.extend(total_price if total_price else ['.'])
price_unit = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[1]/text()')
house_infor_one.extend(price_unit if price_unit else ['.'])
per_price = page_info[i].xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[2]/span/text()')
house_infor_one.extend(per_price if per_price else ['.'])
house_info.append(house_infor_one)
第三步,将已经整理好格式的数据转换为数据框,并给他们的列进行命名,存到本地文件中,至此我们的数据就爬取结束啦
house_df = pd.DataFrame(house_info)
house_df.columns = ['房源描述', '房源来源', '房源地址(路)', '小区名称', '户型信息', '楼层', '人气', '距离地铁', '房本情况(个税)', '看房时间(钥匙)', '房源总价', '房源总价单位', '房源单价(平)','备注']
house_df.to_excel('D:/Tsingtao.xls',)