Python爬虫（含cookie欺骗）

我对python也是自学不久，平常都是用C，有正在学习C语言的朋友，可以进Q群121811911下载软件资料和视频，我们一起进步。

所需工具准备

*安装fiddler

*安装beautifulsoup4


pip install beautifulsoup4 -i https://pypi.douban.com/simple

*python 中的requests（标准库）

beautifulsoup用于解析html文档

bs4可以用于方便地解析html, xml等结构化文档，对于http的爬虫，我们最常用的功能，是解析html文档。

如，对于以下素材：


<pre style="margin: 8px 0px; color: rgb(51, 51, 51); background-color: rgb(238, 255, 204);">html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""</pre>

经过构造：


soup = bs4.BeautifulSoup(html_doc, "html.parser")

之后，标签被转化为soup对象中的各个成员。

其次，有多种方法去定位或遍历标签及标签的属性。


#寻找所有a标签，并以list形式返回

In [10]: soup.find_all('a')

Out[10]:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

对于tag对象，有类似字典的方法拿其对应的属性：


In [11]: for tag in soup.find_all('a'):

    ...: print(tag['href'])

    ...:

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie

对于每个标签，可以通过.text属性，拿到其文本：


In [12]: list_a_tags = soup.find_all('a')

In [13]: list_a_tags[0].text

以下例子，打印出所有class为story的p标签中的text内容：


In [21]: for tag in list_p_tags:

    ...: if tag['class'][0] == 'story':

    ...: print(tag.text)

    ...:

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

python中的requests模块介绍

requests是python自带的标准库，使用它，可以很方便得获取和发送http包。


import bs4

import requests

import sys

def main():

    if len(sys.argv) != 2:

        print('''

        usage:

            word_lookup.py <word>

        ''')

    else:

        url = 'http://dict.cn/' + sys.argv[1]

        rsp = requests.get(url)

        soup = bs4.BeautifulSoup(rsp.text, 'html.parser')

        div_id_content = soup.find(id='content')

        list_strong = div_id_content.find_all('strong')

        for tag in list_strong:

            print(tag.text)

if __name__ == "__main__":

    main()

http中的get方式与post方式

get方式

get方式的请求，其客户端(浏览器）发送的数据，直接放在url的尾部，用户可见。

post方式

post方式的请求，其客户端数据，放在http包内部，普通用户不可见。

requests的作者，同时开发了一款用于测试的服务器，称为httpbin，会响应各种http请求并回复。可以使用www.httpbin.org，也可以参考https://hub.docker.com/r/kennethreitz/httpbin/本地化安装。

使用fiddler进行抓包及http协议分析

每一个session（一个http包），都包含了一个request，和一个response，他们由都分为了两部分:headers， data。

fiddler分析的协议过程，就是查看客户端（浏览器）与服务端到底是如何通信发包的。

爬虫一般要尽量完美地模拟真实的浏览器发包过程。

cookie

cookie是为了网站（服务端）可以跨页面记录用户信息发明的一种机制。

它使得服务端有权限，在客户端创建记录信息的小文件（cookies），而客户端在与服务端通信的过程中，会将这些cookies的内容，一并读取并发给对应网站。

这种机制，使得服务端可以跨页面记录用户的信息。

request库中已经提供了可以长期保持状态的链接方式：


my_session = requests.session() #拿到session对象

my_session.post(url, header_dict, data_dict) #同普通的post或get方法

#不同之处在于, session的记录有连续性（自动保存了cookie等）

当要爬取有身份验证（需要登陆）的网站的信息时，一般有两个大方向：

其一，通过分析http协议，完整模拟出发包登陆的过程。

第二，先手工登陆，再通过复制cookie，用于之后的session中，模拟登陆状态下的抓取。