python3 爬取搜索引擎

前言

在进行红蓝对抗时，前期打点工作必不可少，收集到的信息无非就是域名或者IP。对于域名访问的时候经常遇到404、403等返回码，使用搜索引擎检索到系统入口是不错的选择。但在域名比较多的情况下，又赖得一个一个去搜索引擎检索。所以，需要自动化爬取搜索引擎的一个脚本。
最终，在github上找到一位师傅写的url采集器。https://github.com/MikoSecSoS/GetURLs
代码也很简单，但实际测试有些不如意，基于此，对师傅的代码做了简单修改。

爬取微软Bing搜索引擎

参考大佬的代码，对代码进行了部分修改：

使用multiprocessing 多进程模块，调整使用apply_async方法，该方法为异步非阻塞；
调整word关键字为从文件中读取域名，只使用site语法批量对域名进行采集；
调整biying的搜索接口来发送get请求；
爬取后自动保存为时间命名的txt文件；

代码如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# code by CSeroad

import os
import re
import sys
import time
import requests

from optparse import OptionParser
from multiprocessing import Pool

def download(filename, datas):
    filename = filename.replace("/", "_")
    if not os.path.exists(filename):
        f = open(filename, "w")
        f.close()
    with open(filename, "a") as f:
        for data in datas:
            f.write(str(data) + "\n")

class BingSpider:

    @staticmethod
    def getUrls(page):
        now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
        hd = {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "accept-language": "zh-CN,zh;q=0.9",
            "alexatoolbar-alx_ns_ph": "AlexaToolbar/alx-4.0.3",
            "cache-control": "max-age=0",
            "upgrade-insecure-requests": "1",
            "cookie": "DUP=Q=axt7L5GANVktBKOinLxGuw2&T=361645079&A=2&IG=8C06CAB921F44B4E8AFF611F53B03799; _EDGE_V=1; MUID=0E843E808BEA618D13AC33FD8A716092; SRCHD=AF=NOFORM; SRCHUID=V=2&GUID=CADDA53D4AD041148FEB9D0BF646063A&dmnchg=1; MUIDB=0E843E808BEA618D13AC33FD8A716092; ISSW=1; ENSEARCH=BENVER=1; SerpPWA=reg=1; _EDGE_S=mkt=zh-cn&ui=zh-cn&SID=252EBA59AC756D480F67B727AD5B6C22; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; SRCHUSR=DOB=20190616&T=1560789192000; _FP=hta=on; BPF=X=1; SRCHHPGUSR=CW=1341&CH=293&DPR=1&UTC=480&WTS=63696385992; ipv6=hit=1560792905533&t=4; _SS=SID=252EBA59AC756D480F67B727AD5B6C22&HV=1560790599",
            "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        filename = now_time + ".txt"
        url = "https://cn.bing.com/search?q={}&first={}&FORM=PERE".format(word, page)
        print(url)
        req = requests.get(url, headers=hd)
        if "There are no results for" in req.text:
            return
        urls_titles = re.findall("<h2><a.*?href=\"(.*?)\".*?>(.*?)</a></h2>",req.text)
        data = []
        for url, title in urls_titles:
            title = title.replace("<strong>", "").replace("</strong>", "")
            data.append({
                "title": title,
                "url": url
            })
            print(title, url)
        download(filename, data)



    def main(self):
        pool = Pool(5)
        for i in range(1,5):
            pool.apply_async(func=self.getUrls,args=(i,))
            #BingSpider.getUrls(1)
        pool.close()
        pool.join()


if __name__ == "__main__":
    parser = OptionParser("bingSpider.py -f words.txt")
    parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
    (options, args) = parser.parse_args()
    if options.file:
        file = options.file
        with open(file,'r') as f:
            for line in f.readlines():
                word = line.strip()
                word = "site:"+word
                print("\033[1;37;40m"+word+"\033[0m")
                #word="site:api.baidu.com"
                bingSpider = BingSpider()
                bingSpider.word = word
                bingSpider.main()
    else:
        parser.error('incorrect number of arguments')

运行实例：
将信息收集到域名放在word.txt 文件里。运行python3 bingSpider.py -f word.txt即可。
效果图如下：

image.png

爬取google搜索引擎

biying搜索引擎有着自己的独特，但是google更加强大。
仿照着bingSpider.py脚本可以写出googleSpider.py
爬取google的时候遇到的坑点还是很有意思的，这里说明一下：

爬取google的时候header头一定要写全一些，避免被身份验证；
因为访问google本地需要代理，所以在脚本里也使用proxies代理，且为session请求；
用户使用时需要修改代理的端口
session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
在使用多进程爬取的时候，也增加了sleep，也是为了避免google的验证码；
正则匹配返回内容时，也发生了变化，div标签为class="yuRUbf"；

代码如下

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# code by CSeroad

import os
import re
import sys
import time
import requests

from optparse import OptionParser
from multiprocessing import Pool



def download(filename, datas):
    filename = filename.replace("/", "_")
    if not os.path.exists(filename):
        f = open(filename, "w")
        f.close()
    with open(filename, "a") as f:
        for data in datas:
            f.write(str(data) + "\n")


class GoogleSpider:

    @staticmethod
    def getUrls(page):
        now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
        hd = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Referer": "https://www.google.com/",
            "Cache-control": "max-age=0",
            "Accept-Encoding": "gzip, deflate",
            "Upgrade-insecure-requests": "1",
            "Cookie": "GOOGLE_ABUSE_EXEMPTION=ID=15c1d08c9232025f:TM=1608695949:C=r:IP=52.231.34.93-:S=APGng0veF37IjfSixu2nMBKj7JRlk2A4dg",
            "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        session = requests.session()
        session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
        filename = "google-" + now_time + ".txt"
        url = "https://www.google.com/search?q={}&start={}".format(word, page)
        print("\033[1;37;40m"+url+"\033[0m")
        req = session.get(url,headers=hd)
        #print(req.text)
        if "找不到和您查询的" in req.text:
            return
        urls_titles = re.findall("<div class=\"yuRUbf\"><a href=\"(.*?)\".*?><h3.*?>(.*?)</h3>", req.text)
        #print(urls_titles)
        data = []
        for url, title in urls_titles:
            data.append({
                "title": title,
                "url": url
            })
            print(title, url)
        download(filename, data)

    def main(self):
        pool = Pool(5)
        for i in range(1,5):
            pool.apply_async(func=self.getUrls,args=(i,))
        time.sleep(20)
        #BingSpider.getUrls(1)
        pool.close()
        pool.join()


if __name__ == "__main__":
    parser = OptionParser("googleSpider.py -f words.txt")
    parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
    (options, args) = parser.parse_args()
    if options.file:
        file = options.file
        with open(file,'r') as f:
            for line in f.readlines():
                word = line.strip()
                word = "site:"+word
                print("\033[1;37;40m"+word+"\033[0m")
                googleSpider = GoogleSpider()
                googleSpider.word = word
                googleSpider.main()
    else:
        parser.error('incorrect number of arguments')

同样测试一下，效果图如下：

image.png

对biying和google整合

在修改了以上两个脚本的情况下，这里整合为一个文件UrlSpider.py，更方便得同时爬取两个搜索引擎。
在整合时，需要注意的是：

本地没有代理的情况下，无法爬取google，
本地使用全局代理的情况下，只能爬取google无法爬取必应，
建议本地使用PAC自动代理，爬取google时自行修改session.proxies即可。

代码如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# code by CSeroad

import os
import re
import sys
import time
import requests

from optparse import OptionParser
from multiprocessing import Pool

banner = '''
  ____ ____                           _
 / ___/ ___|  ___ _ __ ___   __ _  __| |
| |   \___ \ / _ \ '__/ _ \ / _` |/ _` |
| |___ ___) |  __/ | | (_) | (_| | (_| |
 \____|____/ \___|_|  \___/ \__,_|\__,_|

'''

def download(filename, datas):
    filename = filename.replace("/", "_")
    if not os.path.exists(filename):
        f = open(filename, "w")
        f.close()
    with open(filename, "a") as f:
        for data in datas:
            f.write(str(data) + "\n")

class BingSpider:

    @staticmethod
    def getUrls(page):
        now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
        hd = {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            "accept-language": "zh-CN,zh;q=0.9",
            "alexatoolbar-alx_ns_ph": "AlexaToolbar/alx-4.0.3",
            "cache-control": "max-age=0",
            "upgrade-insecure-requests": "1",
            "cookie": "DUP=Q=axt7L5GANVktBKOinLxGuw2&T=361645079&A=2&IG=8C06CAB921F44B4E8AFF611F53B03799; _EDGE_V=1; MUID=0E843E808BEA618D13AC33FD8A716092; SRCHD=AF=NOFORM; SRCHUID=V=2&GUID=CADDA53D4AD041148FEB9D0BF646063A&dmnchg=1; MUIDB=0E843E808BEA618D13AC33FD8A716092; ISSW=1; ENSEARCH=BENVER=1; SerpPWA=reg=1; _EDGE_S=mkt=zh-cn&ui=zh-cn&SID=252EBA59AC756D480F67B727AD5B6C22; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; SRCHUSR=DOB=20190616&T=1560789192000; _FP=hta=on; BPF=X=1; SRCHHPGUSR=CW=1341&CH=293&DPR=1&UTC=480&WTS=63696385992; ipv6=hit=1560792905533&t=4; _SS=SID=252EBA59AC756D480F67B727AD5B6C22&HV=1560790599",
            "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        filename = "biying-" + now_time + ".txt"
        url = "https://cn.bing.com/search?q={}&first={}&FORM=PERE".format(word, page)
        print("\033[1;37;40m"+url+"\033[0m")
        req = requests.get(url, headers=hd)
        if "There are no results for" in req.text:
            return
        urls_titles = re.findall("<h2><a.*?href=\"(.*?)\".*?>(.*?)</a></h2>",req.text)
        print(urls_titles)
        data = []
        for url, title in urls_titles:
            title = title.replace("<strong>", "").replace("</strong>", "")
            data.append({
                "title": title,
                "url": url
            })
            print(title, url)
        download(filename, data)


    def main(self):
        pool = Pool(5)
        for i in range(1,5):
            pool.apply_async(func=self.getUrls,args=(i,))
            #BingSpider.getUrls(1)
        pool.close()
        pool.join()


class GoogleSpider:

    @staticmethod
    def getUrls(page):
        now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
        hd = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Referer": "https://www.google.com/",
            "Cache-control": "max-age=0",
            "Accept-Encoding": "gzip, deflate",
            "Upgrade-insecure-requests": "1",
            "Cookie": "GOOGLE_ABUSE_EXEMPTION=ID=15c1d08c9232025f:TM=1608695949:C=r:IP=52.231.34.93-:S=APGng0veF37IjfSixu2nMBKj7JRlk2A4dg",
            "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
        }
        session = requests.session()
        session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
        filename = "google-" + now_time + ".txt"
        url = "https://www.google.com/search?q={}&start={}".format(word, page)
        print("\033[1;37;40m"+url+"\033[0m")
        req = session.get(url,headers=hd)
        #print(req.text)
        if "找不到和您查询的" in req.text:
            return
        urls_titles = re.findall("<div class=\"yuRUbf\"><a href=\"(.*?)\".*?><h3.*?>(.*?)</h3>", req.text)
        #print(urls_titles)
        data = []
        for url, title in urls_titles:
            data.append({
                "title": title,
                "url": url
            })
            print(title, url)
        download(filename, data)

    def main(self):
        pool = Pool(5)
        for i in range(1,6):
            pool.apply_async(func=self.getUrls,args=(i,))
        time.sleep(20)
        #BingSpider.getUrls(1)
        pool.close()
        pool.join()


if __name__ == "__main__":
    print(banner)
    parser = OptionParser("UrlSpider.py -f words.txt")
    parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
    (options, args) = parser.parse_args()
    if options.file:
        file = options.file
        with open(file,'r') as f:
            for line in f.readlines():
                word = line.strip()
                word = "site:"+word
                print("\033[1;37;40m"+word+"\033[0m")
                bingSpider = BingSpider()
                bingSpider.word = word
                bingSpider.main()
                googleSpider = GoogleSpider()
                googleSpider.word = word
                googleSpider.main()
    else:
        parser.error('incorrect number of arguments')

效果图如下：

image.png

处理结果

当脚本运行完毕后，会产生txt文件。

image.png

里面为list类型，分为title和url两部分。
如果需要取出url并截取子目录且去重，还需要下面的formatUrls.py处理脚本。

代码如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import sys

def formatUrls(oldfilename,newfilename):
    if os.path.exists(oldfilename):
        file = open(oldfilename, "r", encoding="utf-8")
        urls=set()
        with open("urls.txt", "a") as f:
            for line in file.readlines():
                url = line[line.index("'url': '")+8:-3]
                print(url)
                urls.add(url)
        #print(urls)
        with open(newfilename,'a+') as f:
            for url in urls:
                f.write(url+'\n')


if __name__ == "__main__":
    if len(sys.argv) == 3:
        oldfilename=sys.argv[1]
        newfilename=sys.argv[2]
        formatUrls(oldfilename,newfilename)
    else:
        print('User: python3 formatUrls.py google-2020-12-23-13.txt result.txt')

运行效果图

image.png

总结

自行修改session.proxies代理，实际测试mac、linux均可以，windows上可能出现异常。
如有问题，还请斧正，欢迎留言。

最后编辑于：2021.01.12 15:50:24

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,905评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,140评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,791评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,483评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,476评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,516评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,905评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,560评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,778评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,557评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,635评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,338评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,925评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,898评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,142评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,818评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,347评论 2赞 342

python3 爬取搜索引擎

前言

爬取微软Bing搜索引擎

爬取google搜索引擎

对biying和google整合

处理结果

总结

推荐阅读更多精彩内容