-
Scrapy安装
环境 : Python 3.6.4
//安装 scrapy pip3 install scrapy
查看scrapy命令信息
scrapy
Scrapy 1.5.1 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
-
命令简介
-
压力测试
得出你的机器的性能,一分钟能爬去多少页面等信息
scrapy bench
2018-08-13 10:52:43 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot) 2018-08-13 10:52:43 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.4 (default, Mar 22 2018, 13:54:22) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-17.7.0-x86_64-i386-64bit 2018-08-13 10:52:44 [scrapy.crawler] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'} 2018-08-13 10:52:44 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', ... 'scrapy.extensions.logstats.LogStats'] 2018-08-13 10:52:44 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 2018-08-13 10:52:44 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 2018-08-13 10:52:48 [scrapy.extensions.logstats] INFO: Crawled 188 pages (at 2400 pages/min), scraped 0 items (at 0 items/min) 2018-08-13 10:52:54 [scrapy.core.engine] INFO: Closing spider (closespider_timeout) 2018-08-13 10:52:54 [scrapy.extensions.logstats] INFO: Crawled 404 pages (at 1920 pages/min), scraped 0 items (at 0 items/min) 2018-08-13 10:52:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 175202, 'downloader/request_count': 420, 'downloader/request_method_count/GET': 420, 'downloader/response_bytes': 1170656, 'downloader/response_count': 420, 'downloader/response_status_count/200': 420, 'finish_reason': 'closespider_timeout', 'finish_time': datetime.datetime(2018, 8, 13, 2, 52, 55, 694980), 'log_count/INFO': 17, 'memusage/max': 51240960, 'memusage/startup': 51240960, 'request_depth_max': 17, 'response_received_count': 420, 'scheduler/dequeued': 420, 'scheduler/dequeued/memory': 420, 'scheduler/enqueued': 8401, 'scheduler/enqueued/memory': 8401, 'start_time': datetime.datetime(2018, 8, 13, 2, 52, 44, 825404)} 2018-08-13 10:52:55 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)
-
爬取测试
测试一个网页是否能爬取 , 以百度首页为例,如图
scrapy fetch http://www.baidu.com
在最后会获得百度主页的html数据
-
列出所有的爬虫
在项目目录下执行下面命令,将列出所有的爬虫名称
scrapy list
-
-
开始一个scrapy爬虫项目
-
创建一个scrapy项目,项目名称命名为scrapy_prj:
scrapy startproject scrapy_prj
结果:
New Scrapy project 'scrapy_prj', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in:
/Users/solumon/Desktop/scrapy_prj
You can start your first spider with:
cd scrapy_prj
scrapy genspider example example.com
-
创建一个爬虫
scrapy genspider ItcastSpider "http://www.itcast.com"
ItcastSpider: 爬虫名称
"http://www.itcast.com" 爬取的域的范围
执行命令后会获取到ItcastSpider.py文件,内容如下
# -*- coding: utf-8 -*-
import scrapy
class ItcastspiderSpider(scrapy.Spider):
name = 'ItcastSpider'
allowed_domains = ['http://www.itcast.com']
start_urls = ['http://http://www.itcast.com/']
def parse(self, response):
pass