第五章 知乎问题爬取

爬取知乎问答

标签(空格分隔): python scrapy session cookie


session和cookie的区别

  • cookie

    cookie是浏览器在本地对用户的信息进行一个key和value的存储,存在严重的安全隐患

  • session

    session是存在服务器上,并给用户进行一个id分配,当用户请求的时候,根据id再将用户信息发给浏览器,有个过期时间。

知乎模拟登陆

  • httpcode 请求状态码
code 状态
200 请求被成功返回
301/302 永久性重定向/临时性重定向
403 没有访问权限
404 表示没有对应的资源
500 服务器错误
503 服务器停机或正在维护
  • 模拟登陆知乎
    原视频录制时,反爬虫机制与现在的不一致,此时的模拟登陆需要,添加一个验证码参数的模拟登陆。
#!/usr/bin/env python3
# _*_ coding: utf-8 _*_
"""
 @author 金全 JQ
 @version 1.0 , 2017/10/25
 @description 模拟知乎登陆
"""

import requests
import re

try:
    import cookielib
except:
    import http.cookiejar as cookielib

import time
try:
    from PIL import Image
except:
    pass
import os

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")

try:
    session.cookies.load(ignore_discard=True)
except:
    print("cookie加载异常")

agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36"
header = {
    "HOST":"www.zhihu.com",
    "Referer":"https://www.zhihu.com",
    "User-Agent":agent
}


#获取xsrf
def get_xsrf():
    response = session.get("https://www.zhihu.com",headers= header)
    print(response.text)
    match_obj = re.findall(r'name="_xsrf" value="(.*?)"', response.text)
    if match_obj:
        return match_obj[0]
    else:
        return ""


def is_login():
    #通过个人中心判断是否为登陆状态
    inbox_url = 'http://www.zhihu.com/inbox'
    response = session.get(inbox_url,headers= header,allow_redirects=False)
    if response.status_code !=200:
        return False
    else:
        return True



def get_captcha():
    # 获取验证码
    t = str(int(time.time()*1000))
    captcha_url = 'http://www.zhihu.com/captcha.gif?r='+t+"&type=login"
    r = session.get(captcha_url,headers = header)
    with open('captcha.jpg','wb') as f :
        f.write(r.content)
        f.close()
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        print(u'请到%s目录下找到captch.jpg 手动输入' %os.path.abspath('captcha.jpg'))
    captcha = input('capture:')
    return captcha


def get_index():
    response = session.get("https://www.zhihu.com", headers=header)
    with open("page_index.html","wb") as f:
        f.write(response.text.encode("utf-8"))
    print("ok")


def zhihu_login(account,password):
    # 知乎手机号登陆
    match_phone = re.match("^1\d{10}$",account)
    if match_phone:
        print("手机号登陆")
        post_number = "https://www.zhihu.com/login/phone_num"
        post_data = {
            "_xsrf": get_xsrf(),
            "phone_num": account,
            "captcha":get_captcha(),
            "password":password
        }
    else:
        if "@" in account:
            print("邮箱登陆")
            post_number = "https://www.zhihu.com/login/email"
            post_data = {
                "_xsrf": get_xsrf(),
                "email": account,
                "captcha": get_captcha(),
                "password": password
            }

    response_text = session.post(post_number, data=post_data, headers=header)
    session.cookies.save()




# zhihu_login("${username}","${password}")
# get_index()
  • scrapy 实现知乎登陆
# -*- coding: utf-8 -*-
import scrapy
import time
try:
    from PIL import Image
except:
    pass

import json
import os

try:
    import urlparse as parse
except:
    from urllib import parse

class ZhihuLoginSpider(scrapy.Spider):
    name = 'zhihu_login'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['http://www.zhihu.com/']
    Agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
    header = {
        'User-Agent': Agent,
    }

    def parse(self, response):
        # 主页爬取的具体内容
        all_urls = response.css('a::attr(href)').extract()
        all_urls = [parse.urljoin(response.url, url) for url in all_urls]
        # 去除javascript
        all_urls = filter(lambda x: True if x.startswith("https") else False, all_urls)
        for url in all_urls:
            pass

    def start_requests(self):
        t = str(int(time.time() * 1000))
        captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + '&type=login&lang=en'
        return [scrapy.Request(url=captcha_url, headers=self.header, callback=self.parser_captcha)]

    def parser_captcha(self, response):
        with open('captcha.jpg', 'wb') as f:
            f.write(response.body)
            f.close()
        try:
            im = Image.open('captcha.jpg')
            im.show()
            im.close()
        except:
            print(u'请到 %s 目录找到captcha.jpg 手动输入' % os.path.abspath('captcha.jpg'))
        captcha = input("please input the captcha\n>")
        return scrapy.FormRequest(url='https://www.zhihu.com/#signin', headers=self.header, callback=self.login, meta={
            'captcha': captcha
        })

    def login(self, response):
        xsrf = response.xpath("//input[@name='_xsrf']/@value").extract_first()
        if xsrf is None:
            return ''
        post_url = 'https://www.zhihu.com/login/phone_num'
        post_data = {
            "_xsrf": xsrf,
            "phone_num": '${username}',
            "password": '${password}',
            "captcha": response.meta['captcha']
        }
        return [scrapy.FormRequest(url=post_url, formdata=post_data, headers=self.header, callback=self.check_login)]

    # 验证返回是否成功
    def check_login(self, response):
        js = json.loads(response.text)
        if 'msg' in js and js['msg'] == '登录成功':
            for url in self.start_urls:
                yield scrapy.Request(url=url, headers=self.header, dont_filter=True)

知乎问题以及回答内容提取

  • scrapy shell 使用
scrapy shell -s USER_AGENT= "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36" 链接
  • 爬虫逻辑

    scrapy 是基于twisted框架的所以为深度优先

    知乎模拟登陆(目的:用于获取更多用户数据):以start_request为入口 获取验证码,回掉图片写入和读取操作,接着进行知乎登陆,通过FromDataRequest带上请求登陆参数,登陆。

    知乎问题爬取:获取到的页面进行所有url获取,进行https匹配,将其中不是请求的数据进行剔除,接着,正则出是问题的url,对提取出的url进行便利(如果有新的url在回掉本函数),对循环的url进行请求获取内容并解析,接着通过知乎的用户回答api进行用户回复api数据读取并解析。

 def parse(self, response):
        # 主页爬取的具体内容
        all_urls = response.css('a::attr(href)').extract()
        all_urls = [parse.urljoin(response.url, url) for url in all_urls]
        # 去除javascript
        all_urls = filter(lambda x: True if x.startswith("https") else False, all_urls)
        for url in all_urls:
            match_obj = re.match("(.*zhihu.com/question/(\d+))(/|$).*",url)
            if match_obj:
                request_url = match_obj.group(1)
                question_id = match_obj.group(2)
                yield scrapy.Request(url=request_url,headers=self.header,meta={'question_id':question_id},callback=self.question_detail)
            else:
                yield scrapy.Request(url=url,headers=self.header,callback=self.parse)

    def question_detail(self,response):
        # 页面内容提取
        item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
        if "QuestionHeader-title" in response.text:
            # 新版本内容提取
            item_loader.add_value('url',response.url)
            item_loader.add_value('zhihu_id',response.meta['question_id'])
            item_loader.add_css('title','.QuestionHeader-title::text')
            item_loader.add_css('content','.QuestionHeader-detail')
            item_loader.add_css('answer_num','.List-headerText span::text')
            item_loader.add_css('comments_num','.QuestionHeader-Comment button::text')
            item_loader.add_css('watch_user_num','.NumberBoard-value::text')
            item_loader.add_css('topics','.QuestionHeader-tags .Popover div::text')
        else:
            # 旧版本的处理
            item_loader.add_value('url', response.url)
            item_loader.add_value('zhihu_id', response.meta['question_id'])
            item_loader.add_css('title', '.zh-question-title h2 a::text')
            item_loader.add_css('content', '#zh-question-detail')
            item_loader.add_css('answer_num', '#zh-question-answer-num::text')
            item_loader.add_css('comments_num', '#zh-question-meta-wrap a[name="addcomment"]::text')
            item_loader.add_css('watch_user_num', '#zh-question-side-header-wrap::text')
            item_loader.add_css('topics', '.zm-tag-editor-labels a::text')
        question_item = item_loader.load_item()
        yield scrapy.Request(url=self.start_answer_url.format(response.meta['question_id'],20,0),headers=self.header,callback=self.parse_answer)
        yield question_item



    def parse_answer(self,response):
        # 处理问题的回答
        answer_item = ZhihuAnswerItem()
        answer_json = json.loads(response.text)
        is_end = answer_json['paging']['is_end']
        next_url = answer_json['paging']['next']

        for answer in answer_json['data']:
            answer_item['zhihu_id'] = answer['id']
            answer_item['url'] = answer['url']
            answer_item['question_id'] = answer['question']['id']
            answer_item['author_id'] = answer['author']['id'] if "id" in answer['author'] else None
            answer_item['content'] = answer['content'] if "content" in answer else None
            answer_item['praise_num'] = answer['voteup_count']
            answer_item['comments_num'] = answer['comment_count']
            answer_item['create_time'] = answer['created_time']
            answer_item['update_time'] = answer['updated_time']
            answer_item['crawl_time'] = datetime.datetime.now()
            yield answer_item

        if not is_end:
            yield scrapy.Request(url=next_url,headers=self.header,callback=self.parse_answer)

知乎问题及回答入库:采用在item中添加方法的形式,其中方法中添加sql语句和参数,插入数据同时需要主键遍历mysql特殊语法,同时对参数进行处理返回,之后通过teisted进行数据的入库。

class ZhihuQuestionItem(scrapy.Item):
    # 知乎问题字段
    zhihu_id = scrapy.Field()
    topics = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    answer_num = scrapy.Field()
    comments_num = scrapy.Field()
    watch_user_num = scrapy.Field()
    click_num = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(self):
        insert_sql="""
            insert into zhihu_question(zhihu_id,topics,url,title,content,answer_num,comments_num,watch_user_num,click_num,crawl_time) 
            values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
        """
        zhihu_id = ''.join(self['zhihu_id'])
        topics = ','.join(self['topics'])
        url = ''.join(self['url'])
        title = ''.join(self['title'])
        content = ''.join(self['content'])
        answer_num = common.regex_match(''.join(self['answer_num']))
        comments_num = common.regex_match(''.join(self['comments_num']))
        watch_user_num = self['watch_user_num'][0]
        click_num = self['watch_user_num'][1]
        crawl_time = datetime.datetime.now().strftime(MYSQL_DATETIEM_STRFTIME)
        params = (zhihu_id,topics,url,title,content,answer_num,comments_num,watch_user_num,click_num,crawl_time)
        return insert_sql,params

class ZhihuAnswerItem(scrapy.Item):
    # 知乎问题回答
    zhihu_id = scrapy.Field()
    url = scrapy.Field()
    question_id = scrapy.Field()
    author_id = scrapy.Field()
    content = scrapy.Field()
    praise_num = scrapy.Field()
    comments_num = scrapy.Field()
    create_time = scrapy.Field()
    update_time = scrapy.Field()
    crawl_time = scrapy.Field()
    def get_insert_sql(self):
        insert_sql="""
            insert into zhihu_answer(zhihu_id,url,question_id,author_id,content,praise_num,comments_num,create_time,update_time,crawl_time)
            values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)ON DUPLICATE KEY UPDATE content = VALUES (content),praise_num = VALUES (praise_num),
            comments_num = VALUES (comments_num),update_time=VALUES (update_time)
        """
        create_time = datetime.datetime.fromtimestamp(self['create_time']).strftime(MYSQL_DATETIEM_STRFTIME)
        update_time = datetime.datetime.fromtimestamp(self['update_time']).strftime(MYSQL_DATETIEM_STRFTIME)
        params = (self['zhihu_id'],self['url'],self['question_id'],
                  self['author_id'],self['content'],self['praise_num'],
                  self['comments_num'],create_time,update_time,
                  self['crawl_time'].strftime(MYSQL_DATETIEM_STRFTIME))
        return insert_sql,params
        
        
        
# 异步方式进行数据插入
class MysqlTwistedPipeline:
    def __init__(self,dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls,settings):
        dbparms = dict(
            host = settings["MYSQL_HOST"],
            db = settings["MYSQL_DBNAME"],
            user = settings["MYSQL_USER"],
            passwd = settings["MYSQL_PASSWORD"],
            charset = 'utf8',
            cursorclass = MySQLdb.cursors.DictCursor,
            use_unicode = True,
        )
        dbpool =  adbapi.ConnectionPool("MySQLdb",**dbparms)
        return cls(dbpool)

    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self.insert_sql,item)
        query.addErrback(self.handle_error,item,spider) # 处理异常

    def handle_error(self,failure,item,spider):
        print(failure)

    def insert_sql(self,cursor,item):
        insert_sql,params = item.get_insert_sql()
        cursor.execute(insert_sql, params

  • 原视频UP主慕课网(聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎)
  • 本篇博客撰写人: XiaoJinZi 个人主页 转载请注明出处
  • 学生能力有限 附上邮箱: 986209501@qq.com 不足以及误处请大佬指责
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 202,980评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,178评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,868评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,498评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,492评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,521评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,910评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,569评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,793评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,559评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,639评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,342评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,931评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,904评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,144评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,833评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,350评论 2 342

推荐阅读更多精彩内容

  • 本文记录了关于知乎用户信息的模块化抓取,使用到了Scrapy这个开源项目,对其不熟悉的同学建议提前了解 知乎是现在...
    朱晓飞阅读 1,066评论 2 2
  • 说起Python,我们或许自然而然的想到其在爬虫方面的重大贡献。Python的流行在于其语言的优美以及良好的氛围。...
    TrancyDeng阅读 4,738评论 12 40
  • 俗话说,千里之行,始于足下。决定好了要带父母旅行,摆在眼前的问题就是去哪里。 之前也有不少朋友,提出过类似的问题,...
    我带爸爸看世界阅读 593评论 8 15
  • 在麦当劳吃着甜筒,听到店里突然放到自己这阵子一直在单曲循环的歌,心情突然好了好多,有点小激动,又有点小开心。 生活...
    hononoi阅读 232评论 0 0
  • 会计从业考试并不难,自学是完全可以的,为什么这么说,因为我就是一个很好的例子,刚刚把从业过了。说说我的方法,先买一...
    TWD阅读 411评论 0 0