[译] 如何从网页中获取主要文本信息

原文 Extracting meaningful content from raw HTML

解析HTML页面不算难事, 开源库Beautiful Soup就给你提供了紧凑直观的接口来处理网页, 让你使用喜欢的语言去实现. 但这也仅仅是第一步. 更有趣的问题是: 怎么才能把有用的信息(主题内容)从网页中抓取出来?

我在过去的几天里一直在尝试回答这个问题, 以下是我找到的.

Arc90 Readability

我最喜欢的方法是叫做 Arc90 Readability 的算法. 由 Arc90 Labs 开发, 目的是用来使网页阅读更佳舒适(例如在移动端上). 你可以找到它的chrome plugin. 整个项目放在Google Code上, 但更有趣的是它的算法, Nirmal Patel用python实现了这个算法, 源码在这里.

整个算法基于HTML-ID-names和HTML-CLASS-names生成了2个列表. 一个是包含了褒义IDs和CLASSes, 一个列表包含了贬义IDs和CLASSes. 如果一个tag有褒义的ID和CLASS, 那么它会得到额外的分数. 反之,如果它包含了贬义的ID和CLASS则丢分. 当我们算完了所有的tag的分数之后, 我们只需要渲染出那些得分较高的tags, 就得到了我们想要的内容.例子如下:

<div id="post"><h1>My post</h1><p>...</p></div>
<div class="footer"><a...>Contact</a></div>

第一个div tag含有一个褒义的ID (“id"=“post”), 所以很有可能这个tag包含了真正的内容(post). 然而, 第二行的tag footer是一个贬义的tag, 我们可以认为这个tag所包含的东西不是我们想要的真正内容. 基于以上, 我们可以得到如下方法:

在HTML源码里找到所有的p tag(即paragraph)
对于每一个paragraph段落:
把该段落的父级标签加入到列表中
并且把该父级标签初始化为0分
如果父级标签包含有褒义的属性, 加分
如果父级标签包含有贬义的属性, 减分
可选: 加入一些额外的标准, 比如限定tag的最短长度
找到得分最多的父级tag
渲染得分最多的父级tag

这里我参考了Nirmal Patel的代码, 写了一个简单的实现. 主要的区别是: 我在算法之前, 写了一个用来清除杂项的代码. 这样最终会得到一个没有脚本, 图片的文本内容, 就是我们想要的网页内容.

import re
from bs4 import BeautifulSoup
from bs4 import Comment
from bs4 import Tag
 
NEGATIVE = re.compile(".*comment.*|.*meta.*|.*footer.*|.*foot.*|.*cloud.*|.*head.*")
POSITIVE = re.compile(".*post.*|.*hentry.*|.*entry.*|.*content.*|.*text.*|.*body.*")
BR = re.compile("<br */? *>[ rn]*<br */? *>")
 
def extract_content_with_Arc90(html):
 
    soup = BeautifulSoup( re.sub(BR, "</p><p>", html) )
    soup = simplify_html_before(soup)
 
    topParent = None
    parents = []
    for paragraph in soup.findAll("p"):
        
        parent = paragraph.parent
        
        if (parent not in parents):
            parents.append(parent)
            parent.score = 0
 
            if (parent.has_key("class")):
                if (NEGATIVE.match(str(parent["class"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["class"]))):
                    parent.score += 25
 
            if (parent.has_key("id")):
                if (NEGATIVE.match(str(parent["id"]))):
                    parent.score -= 50
                elif (POSITIVE.match(str(parent["id"]))):
                    parent.score += 25
 
        if (len( paragraph.renderContents() ) > 10):
            parent.score += 1
 
        # you can add more rules here!
 
    topParent = max(parents, key=lambda x: x.score)
    simplify_html_after(topParent)
    return topParent.text
 
def simplify_html_after(soup):
 
    for element in soup.findAll(True):
        element.attrs = {}    
        if( len( element.renderContents().strip() ) == 0 ):
            element.extract()
    return soup
 
def simplify_html_before(soup):
 
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
 
    # you can add more rules here!
 
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("li"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("em"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("tt"))    # tag to text
    map(lambda x: x.replaceWith(x.text.strip()), soup.findAll("b"))     # tag to text
    
    replace_by_paragraph(soup, 'blockquote')
    replace_by_paragraph(soup, 'quote')
 
    map(lambda x: x.extract(), soup.findAll("code"))      # delete all
    map(lambda x: x.extract(), soup.findAll("style"))     # delete all
    map(lambda x: x.extract(), soup.findAll("script"))    # delete all
    map(lambda x: x.extract(), soup.findAll("link"))      # delete all
    
    delete_if_no_text(soup, "td")
    delete_if_no_text(soup, "tr")
    delete_if_no_text(soup, "div")
 
    delete_by_min_size(soup, "td", 10, 2)
    delete_by_min_size(soup, "tr", 10, 2)
    delete_by_min_size(soup, "div", 10, 2)
    delete_by_min_size(soup, "table", 10, 2)
    delete_by_min_size(soup, "p", 50, 2)
 
    return soup
 
def delete_if_no_text(soup, tag):
    
    for p in soup.findAll(tag):
        if(len(p.renderContents().strip()) == 0):
            p.extract()
 
def delete_by_min_size(soup, tag, length, children):
    
    for p in soup.findAll(tag):
        if(len(p.text) < length and len(p) <= children):
            p.extract()
 
def replace_by_paragraph(soup, tag):
    
    for t in soup.findAll(tag):
        t.name = “p"
        t.attrs = {}

空格符渲染

这个方法是主要思路很简单: 把HTML源码里面的所有tag (所有在<和>之间的代码) 用空格符代替. 当你再次渲染网页的时候, 所有的文本块(text block)依然是”块”状, 但是其他部分变成了包含很多空格符的分散的语句. 你剩下唯一要做的就是把文本快给找出来, 并且移除掉所有其他的内容.
我写了一个简单的实现.

#!/usr/bin/python
# -*- coding: utf-8 -*-
 
import requests
import re
 
from bs4 import BeautifulSoup
from bs4 import Comment
 
if __name__ == "__main__":
    
    html_string = requests.get('http://www.zdnet.com/windows-8-microsofts-new-coke-moment-7000014779/').text
    
    soup = BeautifulSoup(str( html_string ))
    
    map(lambda x: x.extract(), soup.findAll("code"))
    map(lambda x: x.extract(), soup.findAll("script"))
    map(lambda x: x.extract(), soup.findAll("pre"))
    map(lambda x: x.extract(), soup.findAll("style"))
    map(lambda x: x.extract(), soup.findAll("embed"))
    
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
        
    [comment.extract() for comment in comments]
    
    white_string = ""
    isIn = False;
    
    for character in soup.prettify():
 
        if character == "<":
            isIn = True;
        
        if isIn:
            white_string += " "
        else:
            white_string += character
            
        if character == ">":
            isIn = False;
            
    for string in white_string.split("           "):    # tune here!
        
        p = string.strip()
        p = re.sub(' +',' ', p)
        p = re.sub('n+',' ', p)
        
        if( len( p.strip() ) > 50):
            print p.strip()

这里有个问题是, 这个方法不是很通用. 你需要进行参数调优来找到最合适空格符长度来分割得到的字符串. 那些带有大量markup标记的网站通常要比普通网站有更多的空格符. 除此之外, 这个还是一个相当简单有效的方法.

开源社区的库
有句话说: 不要重新造轮子. 事实上有很多的开源库可以解决网页内容抓取的问题. 我最喜欢的是Boilerpipe. 你可以在这里找到它的web服务(demo)http://boilerpipe-web.appspot.com/, 以及它的JAVA实现https://code.google.com/p/boilerpipe/. 相比较上面2个简单的算法, 它真的很有用, 代码也相对更复杂. 但是如果你只把它当做黑盒子使用, 确实是一个非常好的解决方法.

Best regards,
Thomas Uhrig

最后编辑于：2017.12.06 01:38:47

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,524评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,869评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,813评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,210评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,085评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,117评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,533评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,219评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,487评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,582评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,362评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,218评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,589评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,899评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,176评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,503评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,707评论 2赞 335

[译] 如何从网页中获取主要文本信息

Arc90 Readability

空格符渲染

推荐阅读更多精彩内容