66.3-代理豆瓣图书爬虫

当拥有已经是失去,就勇敢的放弃!

总结:

  1. 链接提取器:链接回调rule;处理把链接封装成request 下载后返回的response,在response中要不要再处理啊,由follow控制,
    Rule(LinkExtractor(allow=r'Items/')// 抽取的内容, callback='parse_item'//回调执行函数, follow=True//是否跟进),

Scrapy实战案例——爬取豆瓣读书

scrapy 的功能 解决 URL分页爬取的问题;

需求 : 爬取豆瓣读书,并提取后续链接加入待爬取队列;

1. 链接分析

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

start的值一直在变化;

2. 项目开发

2.1 创建项目 和 setting.py设置
# 同一个目录下创建多个scrapy项目
(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy startproject firstpro .

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy startproject firstpro
New Scrapy project 'firstpro', using template directory 'F:\Pyenv\conda3.8\lib\site-packages\scrapy\templates\project', created in:
    F:\Projects\spider\firstpro

You can start your first spider with:
    cd firstpro
    scrapy genspider example example.com
# setting.py
rom fake_useragent import UserAgent

BOT_NAME = 'mspider'

SPIDER_MODULES = ['mspider.spiders']
NEWSPIDER_MODULE = 'mspider.spiders'

USER_AGENT = UserAgent().random
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False   
spider中出现一个book
2.2 编写item
import scrapy

class MspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    rate = scrapy.Field()

2.3 创建爬虫

scrapy genspider -t template <name> <domain>

模板
-t 模板,这个选项可以使用一个模板来创建爬虫类,常用模板有basic、crawl ;

链接抽取器Rule规则类
class scrapy.spiders.Rule(link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None)


link_extractor:一个LinkExtractor对象,用于定义爬取规则。
callback:下载内容解析的回调函数;满足这个规则的url,应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数,因此不要覆盖parse作为回调函数自己的回调函数。
follow:指定根据该规则从response中提取的链接是否需要跟进(再提取)。
process_links:从link_extractor中获取到链接后会传递给这个函数,用来过滤不需要爬取的链接。

# 创建模板并指定域
(blog) F:\Projects\scrapy\mspider>scrapy genspider -t crawl book douban.com
Created spider 'book' using template 'crawl' in module:
  mspider.spiders.book

# book.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BookSpider(CrawlSpider):
    name = 'book'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

scrapy.spiders.crawl.CrawlSpider 是 scrapy.spiders.Spider的子类,增强了功能。在其中可以使用LinkExtractor、Rule。

规则Rule定义

  1. rules元组里面定义多条规则Rule,用规则来方便的跟进链接
  2. LinkExtractor从response中提取链接
    allow需要一个对象或可迭代对象,其中配置正则表达式,表示匹配什么链接,即只关心 <a> 标签
  3. callback 定义匹配链接后执行的回调,特别注意不要使用parse这个名称。返回一个包含Item或Request对象的列表
    参考 scrapy.spiders.crawl.CrawlSpider#_parse_response
  4. follow是否跟进链接

由此得到一个本例程的规则,如下

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=False),
    )
    # rule = ()

    def parse_item(self, response:HtmlResponse):
        print(response.url)
        print('-'*30)
        i = {}

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i
#-------------------------------------------------------------------------------------------------
F:\Projects\scrapy\mspider>scrapy crawl book1 --nolog
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=60&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1440&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1420&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=160&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=140&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=120&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=100&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=80&type=T
------------------------------

可以看到抽取的是页面内 显示的页码


页码

由此得到一个本例程的规则,如下:

(blog) F:\Projects\scrapy\mspider>scrapy crawl book1 --nolog

# book.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse
from ..items import BookItem

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    custom_settings = {
        'filename':'./book2.json'
    }

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=False),
    )

    def parse_item(self, response:HtmlResponse):
        print(response.url)

        subjects = response.xpath('//li[@class="subject-item"]')

        for subject in subjects:
            title = "".join((x.strip() for x in subject.xpath('.//h2/a//text()').extract()))
            rate = subject.css('span.rating_nums').xpath('./text()').extract()  # extract_first()/extract()[0]

            item = BookItem()
            item['title'] = title
            item['rate'] = rate[0] if rate else '0'

            yield item
# -----------------------------------------------------
<BookItem {'title': '图解HTTP', 'rate': '8.1'}> --------------
<BookItem {'title': 'SQL必知必会: (第4版)', 'rate': '8.5'}> --------------
==========160


# setting.py
rom fake_useragent import UserAgent

BOT_NAME = 'mspider'

SPIDER_MODULES = ['mspider.spiders']
NEWSPIDER_MODULE = 'mspider.spiders'

USER_AGENT = UserAgent().random
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 4      # 并行小一点
DOWNLOAD_DELAY = 1     # 延迟   1  S;
COOKIES_ENABLED = False   

ITEM_PIPELINES = {
   'mspider.pipelines.MspiderPipeline': 300,
}


# pipelines.py

import simplejson
from scrapy import Spider

class MspiderPipeline(object):
    def __init__(self):   # 实例化过程, 可以不用
        print('~~~~~init~~~~~')

    def open_spider(self, spider:Spider):
        self.count = 0
        print(spider.name)
        filename = spider.settings['filename']
        self.file = open(filename, 'w', encoding='utf-8')
        self.file.write('[\n')
        self.file.flush()

    def process_item(self, item, spider):
        print(item, '--------------')
        self.count += 1
        self.file.write(simplejson.dumps(dict(item)) + ',\n')   # dict

        return item

    def close_spider(self, spider):
        print('=========={}'.format(self.count))
        self.file.write(']')
        self.file.close()

爬虫会首先爬取start_urls,按照规则Rule,会分析并抽取页面内匹配的链接,并发起对这些链接的请求,任一请求响应后,执行回调函数parse_item。回调中response就是提取到的链接的页面请求返回的HTML,直接对这个HTML使用xpath或css分析即可。
follow决定着是否在回调函数中对response内容中的链接进行抽取

1. 爬取

修改 follow=True ; 链接抽取器将会 提取页面所有的页码链接(1-99)

2.4 代理(反爬)

在爬取过程中,豆瓣使用了反爬策略,可能会出现以下现象


这相当于封了IP,登录账号会被封账号;所以,可以在爬取时使用代理来解决。

思路:在发起HTTP请求之前, 会经过下载中间件, 自定义一个下载中间件, 在其中临时获取一个代理地址, 然后再发起HTTP请求

代理测试

'mspider.middlewares.ProxyDownloaderMiddleware':150,
'mspider.middlewares.After':600,
如果middlewares.py 中 ProxyDownloaderMiddleware 返回一个response, 它会把处理异常的class After(object): 和 process_exception方法绕过;
创建 test.py

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy genspider -t basic test httpbin.org
Created spider 'test' using template 'basic' in module:
  mspider.spiders.test

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl test
#----------------------------------------------------------------------------------
 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://myip.ipip.net/> (referer: None)
http://www.magedu.com/user?id=1000 ++++++++++++++



# test.py
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['ipip.net']
    start_urls = ['http://myip.ipip.net/']

    def parse(self, response):
        print(response.url,'++++++++++++++')


# middlewares.py
from scrapy.http.response.html import HtmlResponse

class MspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class MspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

# 添加的部分
class ProxyDownloaderMiddleware:

    proxies = []

    def process_request(self, request, spider):

        return HtmlResponse('http://www.magedu.com/user?id=1000')

# 看造出的response怎么处理;
class After(object):
    def process_request(self, request, spider):
        print('After ~~~~~{}')

        return None


# settings 下载中间件中添加 代理ProxyDownloaderMiddleware和 mspider.middlewares.After;
# 暂时关闭pipelines 中间件;
DOWNLOADER_MIDDLEWARES = {
   # 'mspider.middlewares.MspiderDownloaderMiddleware': 543,
   'mspider.middlewares.ProxyDownloaderMiddleware':150,
   'mspider.middlewares.After':600,
}

# ITEM_PIPELINES = {
#    'mspider.pipelines.MspiderPipeline': 300,
# }
  1. 下载中间件
    仿照middlewares.py中的下载中间件,编写process_request,返回None;
    参考https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
from scrapy import signals
from scrapy.http.response.html import HtmlResponse
from scrapy.http.request import Request

import random
class ProxyDownloaderMiddleware:

    proxy_ip = '117.44.10.234'
    proxy_port = 36410
    proxies = [
        'http://{}:{}'.format(proxy_ip, proxy_port)
    ]

    def process_request(self, request:Request, spider):

        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy   # 换代理
        print(request.url,request.meta['proxy'])
        # return None
#----------------------------------------------------
(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl test --nolog
http://myip.ipip.net/ http://223.215.13.40:894
当前 IP:223.215.13.40  来自于:中国 安徽 芜湖  电信  (非本机IP)
 ++++++++++++++

2、配置
在settings.py中

DOWNLOADER_MIDDLEWARES = {
   # 'mspider.middlewares.MspiderDownloaderMiddleware': 543,
   'mspider.middlewares.ProxyDownloaderMiddleware':150,    # 优先级可以高一点;
}
IP代理成功

爬取豆瓣50页

# book.py        follow=True
    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=True),
    )

# settings
ITEM_PIPELINES = {         # 开启pipelines
   'mspider.pipelines.MspiderPipeline': 300,
}

# CONCURRENT_REQUESTS = 4
# DOWNLOAD_DELAY = 1

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl book1 --nolog

{'rate': '8.8', 'title': '黑客与画家: 来自计算机时代的高见'} ~----~~~~~~~~~
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1060&type=T http://223.215.13.40:894
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1040&type=T
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1060&type=T
==========1000   # 

刚好 50 页爬取1000条数据成功;

参考:

# books
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse
from ..items import BookItem

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T']

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=True),
    )

    custom_settings = {
        'filename':'./books2.json'
    }

    def parse_item(self, response:HtmlResponse):
        print(response.url)

        subjects = response.xpath('//li[@class="subject-item"]')

        for subject in subjects:
            title = "".join((x.strip() for x in subject.xpath('.//h2/a//text()').extract()))
            rate = subject.css('span.rating_nums').xpath('./text()').extract()  # extract_first()/extract()[0]

            item = BookItem()
            item['title'] = title
            item['rate'] = rate[0] if rate else '0'
            print(item)

            yield item



最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,230评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,261评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,089评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,542评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,542评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,544评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,922评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,578评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,816评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,576评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,658评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,359评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,937评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,920评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,156评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,859评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,381评论 2 342