标签: 信息检索
1. 创建一个Scrapy项目
scrapy startproject tutorial
2. 定义提取的Item
import scrapy
class DmozItem(scrapy.Item):
title=scrapy.Field()
link=scrapy.Field()
desc=scrapy.Field()
3. 编写爬取网站的 spider 并提取 Item
3.1编写初始spider
import scrapy
class DmozSpider(scrapy.Spider):
name="dmoz"
allowed_domains=["dmoz.org"]
start_urls=[
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename=response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
3.2爬取
scrapy crawl dmoz
4. 存储提取到的Item(即数据)
4.1提取数据
4.2修改spider提取数据
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
4.3 保存爬取的数据
scrapy crawl dmoz -o items.json
阅读材料:
scrapy官方文档