Scrapy+MySQL爬取豆瓣电影TOP250

说真的，不知道为啥！只要一问那些做过爬虫的筒靴，不管是自己平时兴趣爱好亦或是刚接触入门，都喜欢拿豆瓣网作为爬虫练手对象，以至于到现在都变成了没爬过豆瓣的都不好意思说自己搞过爬虫了。好了，切入正题......

一、系统环境

Python版本：2.7.12（64位）
Scrapy版本：1.4.0
Mysql版本：5.6.35（64位）
系统版本：Win10（64位）
MySQLdb版本: MySQL-python-1.2.3.win-amd64-py2.7（64位）
开发IDE：PyCharm-2106.3.3（64位）

二、安装MySQL数据库

官网下载地址：http://www.mysql.com/downloads/
可以顺带装个图形化工具，我用的是Navicat-for-MySQL11.0.9，官网下载地址：http://www.formysql.com/xiazai_mysql.html

2.1、安装MySQLdb

ok，到这里，说明上面的MySQL已经安装成功了，接下来你需要安装MySQLdb了。

2.2、什么是MySQLdb？

MySQLdb 是用于Python链接Mysql数据库的接口，它实现了 Python 数据库 API 规范 V2.0，基于 MySQL C API 上建立的；简单来说，就是类似于Java中的JDBC。

2.3、如何安装MySQLdb？

目前你有两个选择：

1、安装已编译好的版本（强烈推荐）

2、从官网下载，自己编译安装（这个真要取决于个人的RP人品了，如果喜欢折腾的话不妨可以试他一试，在此不做介绍，请自行度娘即可）

ok，我们选择第一种方式，官网下载地址：http://www.codegood.com/downloads，大家根据自己的系统自行下载即可，下载完毕直接双击进行安装，可以修改下安装路径，然后一路next即可。

image.png

2.4、验证MySQLdb是否安装成功

cmd——》输入python——》输入import MySQLdb，查看是否报错，没有报错则说明MySQLdb安装成功！

image.png

2.5、如何使用MySQLdb

请大家自行参考W3C教程：http://www.runoob.com/python/python-mysql.html

2.6、熟悉XPath

抓取网页时，你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的。

BeautifulSoup：是在程序员间非常流行的网页分析库，它基于HTML代码的结构来构造一个Python对象，对不良标记的处理也非常合理，但它有一个缺点：慢。

lxml：是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。

XPath：即为XML路径语言，它是一种用来确定XML（标准通用标记语言的子集）文档中某部分位置的语言。XPath基于XML的树状结构，有不同类型的节点，包括元素节点，属性节点和文本节点，提供在数据结构树中找寻节点的能力。

Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors)，因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

关于XPath的使用，大家可以自行参考官网教程：https://www.w3.org/TR/xpath/
或者中文教程：http://www.w3school.com.cn/xpath/index.asp

ok，有了上面这些基本的准备工作之后，我们可以开始正式编写爬虫程序了。这里以豆瓣电影TOP250为例：https://movie.douban.com/top250

三、编写爬虫

首先我们使用Chrome或者Firefox浏览器打开这个地址，然后一起分析下这个页面的html元素结构，按住F12键即可查看网页源代码。分析页面我们可以看到，最终需要提取的信息都已经被包裹在class属性为grid_view的这个ol里面了，所以我们就可以基本确定解析范围了，以这个ol元素为整个大的边框，然后再在里面进行查找定位即可。

image.png

然后具体细节在此就不罗嗦了，直接撸代码吧：
完整的代码已经上传至github上git@github.com:hu1991die/douan_movie_spider.git，欢迎fork，欢迎clone！
1、DoubanMovieTop250Spider.py

# encoding: utf-8
'''
@author: feizi
@file: DoubanMovieTop250Spider.py
@Software: PyCharm
@desc:
'''
import re

from scrapy import Request
from scrapy.spiders import Spider
from douan_movie_spider.items import DouanMovieItem

class DoubanMovieTop250Spider(Spider):
    name = 'douban_movie_top250'

    def start_requests(self):
        url = 'https://movie.douban.com/top250'
        yield Request(url)

    def parse(self, response):
        item = DouanMovieItem()
        movieList = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movieList:
            # 排名
            rank = movie.xpath('.//div[@class="pic"]/em/text()').extract_first()
            # 封面
            cover = movie.xpath('.//div[@class="pic"]/a/img/@src').extract_first()
            # 标题
            title = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()
            # 评分
            score = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()
            # 评价人数
            comment_num = movie.xpath('.//div[@class="star"]/span[4]/text()').re(ur'(\d+)')[0]
            # 经典语录
            quote = movie.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()
            # 上映年份,上映地区，电影分类
            briefList = movie.xpath('.//div[@class="bd"]/p/text()').extract()
            if briefList:
                # 以'/'进行分割
                briefs = re.split(r'/', briefList[1])
                # 电影分类
                types = re.compile(u'([\u4e00-\u9fa5].*)').findall(briefs[len(briefs) - 1])[0]
                # 上映地区
                region = re.compile(u'([\u4e00-\u9fa5]+)').findall(briefs[len(briefs) - 2])[0]
                if len(briefs) <= 3:
                    # 上映年份
                    years = re.compile(ur'(\d+)').findall(briefs[len(briefs) - 3])[0]
                else:
                    # 上映年份
                    years = ''
                    for brief in briefs:
                        if hasNumber(brief):
                            years = years + re.compile(ur'(\d+)').findall(brief)[0] + ","
                            print years

                if types:
                    # 替换空格为“,”
                    types = types.replace(" ", ",")

            print(rank, cover, title, score, comment_num, quote, years, region, types)
            item['rank'] = rank
            item['cover'] = cover
            item['title'] = title
            item['score'] = score
            item['comment_num'] = comment_num
            item['quote'] = quote
            item['years'] = years
            item['region'] = region
            item['types'] = types
            yield item

        # 获取下一页url
        next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()
        if next_url:
            next_url = 'https://movie.douban.com/top250' + next_url
            yield Request(next_url)

def hasNumber(str):
    return bool(re.search('\d+', str))

2、items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# 电影实体类
class DouanMovieItem(scrapy.Item):
    # 排名
    rank = scrapy.Field()
    # 封面
    cover = scrapy.Field()
    # 标题
    title = scrapy.Field()
    # 评分
    score = scrapy.Field()
    # 评价人数
    comment_num = scrapy.Field()
    # 经典语录
    quote = scrapy.Field()
    # 上映年份
    years = scrapy.Field()
    # 上映地区
    region = scrapy.Field()
    # 电影类型
    types = scrapy.Field()

3、pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
from scrapy.exceptions import DropItem

from douan_movie_spider.items import DouanMovieItem

# 获取数据库连接
def getDbConn():
    conn = MySQLdb.Connect(
        host='127.0.0.1',
        port=3306,
        user='root',
        passwd='123456',
        db='testdb',
        charset='utf8'
    )
    return conn

# 关闭数据库资源
def closeConn(cursor, conn):
    # 关闭游标
    if cursor:
        cursor.close()
    # 关闭数据库连接
    if conn:
        conn.close()


class DouanMovieSpiderPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['title'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['title'])
            if item.__class__ == DouanMovieItem:
                self.insert(item)
                return
        return item

    def insert(self, item):
        try:
            # 获取数据库连接
            conn = getDbConn()
            # 获取游标
            cursor = conn.cursor()
            # 插入数据库
            sql = "INSERT INTO db_movie(rank, cover, title, score, comment_num, quote, years, region, types)VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s)"
            params = (item['rank'], item['cover'], item['title'], item['score'], item['comment_num'], item['quote'], item['years'], item['region'], item['types'])
            cursor.execute(sql, params)

            #事务提交
            conn.commit()
        except Exception, e:
            # 事务回滚
            conn.rollback()
            print 'except:', e.message
        finally:
            # 关闭游标和数据库连接
            closeConn(cursor, conn)

4、main.py

# encoding: utf-8
'''
@author: feizi
@file: main.py
@Software: PyCharm
@desc:
'''

from scrapy import cmdline

name = "douban_movie_top250"
# cmd = "scrapy crawl {0} -o douban.csv".format(name)
cmd = "scrapy crawl {0}".format(name)
cmdline.execute(cmd.split())

5、settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douan_movie_spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douan_movie_spider'

SPIDER_MODULES = ['douan_movie_spider.spiders']
NEWSPIDER_MODULE = 'douan_movie_spider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3013.3 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'douan_movie_spider.middlewares.DouanMovieSpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'douan_movie_spider.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'douan_movie_spider.pipelines.DouanMovieSpiderPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

需要注意一点，为了防止爬虫被ban，我们可以设置一下USER-AGENT.
还是F12键，查看一下Request Headers请求头，找到User-Agent信息然后设置到settings文件中即可。当然，这只是一种简单的方式，其他更复杂的策略如IP池，User-Agent池请自行google吧，这里不做赘述。

image.png

四、运行爬虫

image.png

五、保存结果

image.png

六、简单数据可视化分析

最后，给大家看下简单的数据可视化分析效果。

6.1、评分top10

image.png

6.2、标题云

image.png

6.3、语录云

image.png

6.4、评论TOP10

image.png

6.5、每一年电影上映数统计

image.png

6.6、上映地区统计

image.png

6.7、电影类型汇总

image.png

项目完整代码已上传至github:https://github.com/hu1991die/douan_movie_spider，欢迎fork~~~

最后编辑于：2017.12.08 10:19:34

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,524评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,869评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,813评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,210评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,085评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,117评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,533评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,219评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,487评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,582评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,362评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,218评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,589评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,899评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,176评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,503评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,707评论 2赞 335

Scrapy+MySQL爬取豆瓣电影TOP250

一、系统环境

二、安装MySQL数据库

2.1、安装MySQLdb

2.2、什么是MySQLdb？

2.3、如何安装MySQLdb？

2.4、验证MySQLdb是否安装成功

2.5、如何使用MySQLdb

2.6、熟悉XPath

三、编写爬虫

四、运行爬虫

五、保存结果

六、简单数据可视化分析

6.1、评分top10

6.2、标题云

6.3、语录云

6.4、评论TOP10

6.5、每一年电影上映数统计

6.6、上映地区统计

6.7、电影类型汇总

推荐阅读更多精彩内容