python爬虫(1)_获取网页

分析网站

识别对方使用技术-builtwith模块

pip install builtwith
使用：
>>> import builtwith 
>>> builtwith.parse("http://127.0.0.1:8000/examples/default/index")
{u'javascript-frameworks': [u'jQuery'], u'font-scripts': [u'Font Awesome'], u'web-frameworks': [u'Web2py'], u'programming-languages': [u'Python']}

寻找网站所有者

安装：pip install python-whois

使用：

import whois
print whois.whois("appspot.com")
{
"updated_date": [
"2017-02-06 00:00:00",
"2017-02-06 02:26:49"
],
"status": [
"clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited",
"clientTransferProhibited https://icann.org/epp#clientTransferProhibited",
"clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited",
"serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited",
"serverTransferProhibited https://icann.org/epp#serverTransferProhibited",
"serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited",
"clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)",
"clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)",
"clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)",
"serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)",
"serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)",
"serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)"
],
"name": "DNS Admin",
"dnssec": "unsigned",
"city": "Mountain View",
"expiration_date": [
"2018-03-10 00:00:00",
"2018-03-09 00:00:00"
],
"zipcode": "94043",
"domain_name": [
"APPSPOT.COM",
"appspot.com"
],
"country": "US",
"whois_server": "whois.markmonitor.com",
"state": "CA",
"registrar": "MarkMonitor, Inc.",
"referral_url": "http://www.markmonitor.com",
"address": "2400 E. Bayshore Pkwy",
"name_servers": [
"NS1.GOOGLE.COM",

"NS2.GOOGLE.COM", 
"NS3.GOOGLE.COM", 
"NS4.GOOGLE.COM", 
"ns1.google.com", 
"ns4.google.com", 
"ns2.google.com", 
"ns3.google.com"

],
"org": "Google Inc.",
"creation_date": [
"2005-03-10 00:00:00",
"2005-03-09 18:27:55"
],
"emails": [
"abusecomplaints@markmonitor.com",
"dns-admin@google.com"
]
}
可以看到改域名归属google。

### 编写第一个爬虫
#### 下载网页
要想爬取网页，我们首先将其下载下来，下面示例使用Python的urllib2模块下载url。
1. 基本写法
```python
import urlib2
def download(url):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
    return html

重试下载
当爬取是对方服务器可能会返回500等服务端错误，当出现服务端错误时，我们可以试着重试下载，因为目标服务器是没有问题的，我们可以试着重试下载。
示例：

import urlib2
def download(url, num_retries=3):
    print 'Downloading:', url
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1)
    return html

我们试着访问http://httpstat.us/500,该网站会返回500错误，代码如下：

if __name__ == '__main__':
    download("http://httpstat.us/500")
    pass

执行结果：

Downloading: http://httpstat.us/500
Downloading error: Internal Server Error
Downloading: http://httpstat.us/500
Downloading error: Internal Server Error
Downloading: http://httpstat.us/500
Downloading error: Internal Server Error
Downloading: http://httpstat.us/500
Downloading error: Internal Server Error

可以看到重试了3次才会放弃，则重试下载成功。

设置用户代理
python 访问网站时，默认使用Python-urllib/2.7作为默认用户代理，其中2.7为python的版本号，对于一些网站会拒绝这样的代理下载，所以为了正常的访问，我们需要重新设置代理。

import urllib2
def download(url, num_retries=3, user_agent="wswp"):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1, user_agent)
    return html

使用代理：

if __name__ == '__main__':
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    html = download("http://www.meetup.com",user_agent)
    print html
    pass

链接爬虫

使用链接爬虫可以爬下整个网站的链接，但是通常我们只需要爬下我们感兴趣的链接，所以我们，我们可以使用正则表达式来匹配，代码如下：

import urllib2
import re
import urlparse
def download(url, num_retries=3, user_agent="wswp"):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1, user_agent)
    return html

def link_crawler(sell_url, link_regex):
    crawl_queue = [sell_url]
    seen = set(crawl_queue)
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        for link in get_links(html):
            if re.match(link_regex, link):
                print link
                # check if have already seen this link
                link = urlparse.urljoin(sell_url, link)
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)

def get_links(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    return webpage_regex.findall(html)

if __name__ == '__main__':
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
    link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)')
    pass

高级功能

支持代理

有时候一些网站屏蔽了很多国家，所以我们需要使用代理来访问这些网站，下面我们使用urllib2示例支持代码：

proxy = ...
opener = urllib2.build_opener()
proxy_params = {urlparse.urlparse(url).scheme:proxy}
opener.add_handle(urllib2.ProxyHandler(proxy_param))
response = opener.open(request)

将上面的代码集成到下载示例中，如下：

def download(url, num_retries=3, user_agent="wswp", proxy=None):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)

    # add proxy
    opener = urllib2.build_opener()
    if proxy:
        proxy_param = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_param))

    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1, user_agent, proxy)
    return html

下载限速

很多时候，我们处理爬虫时，经常会遇到由于访问速度过快，会面临被封禁或造成对面服务器过载的风险，为了能正常模拟用户的访问，避免这些风险，我们需要在两次下载之间添加延迟，对爬虫进行限速，实现示例如下：


import urlparse


class Throttle:
    """Add a delay between downloads to the same domain """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # 存储访问一个网站的最后时间点
        self.domains = {}

    def wait(self, url):
        domain = urlparse.urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.datetime.new() - last_accessed).seconds
            if sleep_secs > 0:
                # 在访问网站之前延迟sleep_secs之后进行下次访问
                time.sleep(sleep_secs)
        # 更新最后一次访问同一网站的时间
        self.domains[domain] = datetime.datetime.new()

Throttle类记录了每个域名上次访问的时间，如果当前时间距离上次访问时间小于制定延迟，则执行睡眠操作，这样我们在每次下载之前调用Throttle对象对爬虫进行限速，集成之前的下载代码如下：

# 在下载之前添加
throttle = Throttle(delay)
...
throttle.wait(url)
result = download(url, num_retries=num_retries, user_agent=user_agent, proxy=proxy)

避免爬虫陷阱

所谓的爬虫陷阱是指：之前我们使用追踪链接或爬取该网站的所有链接，但是有一种情况，就是在当前页面包含下个页面的链接，下个页面包含下下个页面的链接，也就是可以无休止的链接下去，这种情况我们称作爬虫链接。

想要避免这种情况，一个简单的方法就是我们记录到达当前页面经过了多少链接，也就是我们说的深度，当达到最大深度时，爬虫不再向队列中添加该网页的链接，我们在之前追踪链接的代码上添加这样的功能，代码如下：

import urllib2
import re
import urlparse

# 新增限制访问页面深度的功能
def download(url, num_retries=3, user_agent="wswp"):
    print 'Downloading:', url
    headers = {'User-agent': user_agent}
    request = urllib2.Request(url, headers=headers)
    try:
        html = urllib2.urlopen(request).read()
    except urllib2.URLError as e:
        print 'Downloading error:', e.reason
        html = None
        if num_retries > 0:
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, num_retries - 1, user_agent)
    return html

def link_crawler(sell_url, link_regex, max_depth=2):
    max_depth = 2
    crawl_queue = [sell_url]
    # 将seen修改为一个字典，增加页面访问深度的记录
    seen = {}
    seen[sell_url] = 0
    
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # 获取长度，判断是否到达了最大深度
        depth = seen[url]
        if depth != max_depth:
            for link in get_links(html):
                if re.match(link_regex, link):
                    print link
                    # check if have already seen this link
                    link = urlparse.urljoin(sell_url, link)
                    if link not in seen:
                        seen[link] = depth + 1
                        crawl_queue.append(link)

def get_links(html):
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    return webpage_regex.findall(html)

if __name__ == '__main__':
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
    link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)')
    pass

当然了，如果你想禁用这个功能，只需要将max_depth设为负数即可。

最终版本代码

import re
import urlparse
import urllib2
import time
from datetime import datetime
import robotparser
import Queue


def link_crawler(seed_url, link_regex=None, delay=5, max_depth=-1, max_urls=-1, headers=None, user_agent='wswp', proxy=None, num_retries=1):
    """Crawl from the given seed URL following links matched by link_regex
    """
    # the queue of URL's that still need to be crawled
    crawl_queue = Queue.deque([seed_url])
    # the URL's that have been seen and at what depth
    seen = {seed_url: 0}
    # track how many URL's have been downloaded
    num_urls = 0
    rp = get_robots(seed_url)
    throttle = Throttle(delay)
    headers = headers or {}
    if user_agent:
        headers['User-agent'] = user_agent

    while crawl_queue:
        url = crawl_queue.pop()
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            throttle.wait(url)
            html = download(url, headers, proxy=proxy, num_retries=num_retries)
            links = []

            depth = seen[url]
            if depth != max_depth:
                # can still crawl further
                if link_regex:
                    # filter for links matching our regular expression
                    links.extend(link for link in get_links(html) if re.match(link_regex, link))

                for link in links:
                    link = normalize(seed_url, link)
                    # check whether already crawled this link
                    if link not in seen:
                        seen[link] = depth + 1
                        # check link is within same domain
                        if same_domain(seed_url, link):
                            # success! add this new link to queue
                            crawl_queue.append(link)

            # check whether have reached downloaded maximum
            num_urls += 1
            if num_urls == max_urls:
                break
        else:
            print 'Blocked by robots.txt:', url


class Throttle:
    """Throttle downloading by sleeping between requests to same domain
    """
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}
        
    def wait(self, url):
        domain = urlparse.urlparse(url).netloc
        last_accessed = self.domains.get(domain)

        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.now() - last_accessed).seconds
            if sleep_secs > 0:
                time.sleep(sleep_secs)
        self.domains[domain] = datetime.now()


def download(url, headers, proxy, num_retries, data=None):
    print 'Downloading:', url
    request = urllib2.Request(url, data, headers)
    opener = urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        response = opener.open(request)
        html = response.read()
        code = response.code
    except urllib2.URLError as e:
        print 'Download error:', e.reason
        html = ''
        if hasattr(e, 'code'):
            code = e.code
            if num_retries > 0 and 500 <= code < 600:
                # retry 5XX HTTP errors
                return download(url, headers, proxy, num_retries-1, data)
        else:
            code = None
    return html


def normalize(seed_url, link):
    """Normalize this URL by removing hash and adding domain
    """
    link, _ = urlparse.urldefrag(link) # remove hash to avoid duplicates
    return urlparse.urljoin(seed_url, link)


def same_domain(url1, url2):
    """Return True if both URL's belong to same domain
    """
    return urlparse.urlparse(url1).netloc == urlparse.urlparse(url2).netloc


def get_robots(url):
    """Initialize robots parser for this domain
    """
    rp = robotparser.RobotFileParser()
    rp.set_url(urlparse.urljoin(url, '/robots.txt'))
    rp.read()
    return rp
        

def get_links(html):
    """Return a list of links from html 
    """
    # a regular expression to extract all links from the webpage
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)
    # list of all links from the webpage
    return webpage_regex.findall(html)


if __name__ == '__main__':
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
    
    link_crawler('http://baozoumanhua.com/video_channels/1745', '/(videos)', delay=0, num_retries=1, user_agent=user_agent)

上面就是集成以上所有功能的版本，现在我们可以使用这个爬虫执行看看效果了，在终端输入：python xxx.py（xxx.py是你在上面保存代码的文件名）

最后编辑于：2017.12.08 05:20:18

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,761评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,953评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,998评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,248评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,130评论 4赞 356
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,145评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,550评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,236评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,510评论 1赞 291
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,601评论 2赞 310
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,376评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,247评论 3赞 313
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,613评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,911评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,191评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,532评论 2赞 342
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,739评论 2赞 335

python爬虫(1)_获取网页

分析网站

链接爬虫

高级功能

推荐阅读更多精彩内容