第52章 文本数据预处理

无论是深度学习还是自然语言处理,一个非常重要的话题就是将自然语言转换成计算机可以识别的特征向量。文本的预处理一般都是通过文本分词 -> 词嵌入 -> 特征提取等步骤处理后,组件能够代表文本内容的矩阵向量。


可见,词嵌入是一个重要的概念。词嵌入,Word Embedding,也称为词向量。在深入了解这个概念之前,先看看几个例子,

  • 在购买商品或者入住酒店后,邀请顾客填写评价来评估对服务的满意程度。
  • 使用几个词组合在搜索引擎里搜索内容。
  • 某些博客或新闻网站会自动标记一些相关的标签(tag)。




卷积 -> 数学 + 深度学习 = 梯度 + 求导等等。






  • World
  • Sports
  • Business
  • Sci/Tech

tensorflow_datasets提供了对AG News的数据集,代码如下,

import tensorflow_datasets as tfds
import jax

def setup():
    (trains, tests), metas = tfds.load("ag_news_subset", split = [tfds.Split.TRAIN, tfds.Split.TEST], with_info = True, as_supervised = True, data_dir = "/tmp/")

    class_names = metas.features["label"].names
    number_classes = metas.features["label"].num_classes
    number_trains = metas.splits["train"].num_examples
    number_tests = metas.splits["test"].num_examples
    buffer_size = 1000
    batch_size = 50
    trains = trains.shuffle(buffer_size)
    trains = trains.batch(batch_size).prefetch(1)
    trains = tfds.as_numpy(trains)
    train_news = []
    train_labels = []
    for news, label in trains:
    tests = tests.batch(batch_size).prefetch(1)
    tests = tfds.as_numpy(tests)
    test_news = []
    test_labels = []
    for news, label in tests:
    return train_news, train_labels, test_news, test_labels
def main():
    train_news, train_labels, test_news, test_labels = setup()
    for (label, news) in zip(train_labels[0: 2], train_news[: 2]):
        print(f"label = {label}, news = {news}")
if __name__ == "__main__":


label = [3 0 0 1 1 3 2 0 1 3 2 0 0 2 1 1 3 1 3 3 2 2 2 0 0 0 2 2 0 1 3 3 2 0 0 1 2
 3 1 0 2 0 0 1 1 3 0 0 0 1], news = [b'Sony BMG - aka  #39;Bony #39; - the merged music label is in talks with Grokster, the P2P software company has confirmed. Negotiations are believed to be focused on the development of a new, label-friendly P2P network.'
 b'AP - U.N. Secretary-General Kofi Annan, fending off Republican demands for his resignation over alleged corruption, said Thursday he will expand U.N. support for Iraqi elections if need be. He said he was not offended that President Bush did not ask to see him during this visit to Washington.'
 b'World News: Islamabad, Oct 30 : Pakistan President Pervez Musharraf Saturday said a solution to the dragging Kashmir dispute could be found only if India and Pakistan agreed to move beyond their stated positions on the issue.'
 b'If you believe Dale Earnhardt Jr. is hurting now, wait until late next month, should he fail to recoup the 25 points he lost for uttering a naughty word on national television following last week #39;s emotional victory at Talladega (Ala.'
 b'Olympic rowing hero James Cracknell has revealed a lack of hunger was behind his decision to take a year away from the sport. The 32-year-old, who was part of the British gold-medal winning coxless fours in '
 b'The PC maker is eyeing in-home services to go with its new consumer electronics line.'
 b'It #39;s another level of security for America Online - but users will have to pay extra for it. AOL is offering an optional log-on service that will require more than just a password to get onto the service.'
 b"NEW YORK - Gary Sheffield hit an RBI single, and the New York Yankees took a 1-0 lead over Pedro Martinez and the Boston Red Sox in Game 2 of the AL championship series Wednesday night.    New York starter Jon Lieber allowed only one hit and a walk, stifling Boston's sluggers just as Mike Mussina did the night before..."
 b'England captain Michael Vaughan said on Monday that he expected an evenly matched series against South Africa but suggested his bowlers could hold the key.'
 b'Hoping to throw some tacks in the road to slow Linux momentum, Microsoft during the next year will redouble its efforts to woo more corporate users migrating from Unix to the open source OS.'
 b'Corus, Britain #39;s biggest steelmaker, yesterday reported its first profits since the merger of British Steel and Dutch rival Hoogovens five years ago.'



图1 train.csv预览



import csv

def setup():
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        trains = csv.reader(handler)
        for line in trains:
def main():
    #trains, vidations, tests = setup()
    #print(trains, vidations, tests)
if __name__ == "__main__":


['3', 'Wall St. Bears Claw Back Into the Black (Reuters)', "Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."]
['3', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.']
['3', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.']
['3', 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.']
['3', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)', 'AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.']


import csv

def setup():
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        labels = []
        titles = []
        descriptions = []
        trains = csv.reader(handler)
        for line in trains:
        return labels, titles, descriptions
def main():
    labels, titles, descriptions = setup()
    print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":


['3', '3', '3', '3', '3'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)']





def purify(string, pattern = r"[^a-z0-9]", replacement = " "):
    string = re.sub(pattern = pattern, replacement, text)


def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
    string = string.lower()
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    # Trim the string
    string = string.strip()
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    return strings


import csv
import re
import jax

def setup():
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        labels = []
        titles = []
        descriptions = []
        trains = csv.reader(handler)
        for line in trains:
        return labels, titles, descriptions
def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
    string = string.lower()
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    # Trim the string
    string = string.strip()
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    return strings
def main():
    labels, titles, descriptions = setup()
    print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":


[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']]


观察分词后的文本集,每组文本中除了能够表达含义的名称和动词外, 还有大量没有实际意义、去除后不影响含义的词,例如is、are、the等。这些词的存在并不会给句子增加太多的含义,反而会由于出现次数过多而影响后续的词嵌入分析。为了减少要处理的词汇、降低后续程序的复杂度,需要清除停用词。清除停用词一般使用NLTK的工具包。NLTK通过以下命令或者在IDE里安装,

pip install nltk

但仅仅安装NLTK并不能清除停用词,还需要下载NLTK停用词包,建议通过Python命令行进行NLTK控制台安装,在命令行输入python进入python命令行,然后依次输入以下命令,其中Config> d可选,

Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
Downloader> c

Data Server:
  - URL: <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml>
  - 7 Package Collections Available
  - 113 Individual Packages Available

Local Machine:
  - Data directory: /home/jinhui/nltk_data

    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
Config> d
  New Directory> /tmp/

    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
Config> m

    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /tmp/...
      Package stopwords is already up-to-date!

    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit


import ssl
import nltk
def purify_stop_words():
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        ssl._create_default_https_context = _create_unverified_https_context
    nltk.download("stopwords", download_dir = "/tmp/");
    stops = nltk.corpus.stopwords.words("English")
    return stops

def main():
    #labels, titles, descriptions = setup()
    #print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":


[nltk_data] Downloading package stopwords to /tmp/...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]




import csv
import re
import jax
import ssl
import nltk
def stop_words():
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        ssl._create_default_https_context = _create_unverified_https_context
    nltk.download("stopwords", download_dir = "/tmp/");
    stops = nltk.corpus.stopwords.words("English")
    return stops

def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " ", stops = stop_words()):
    string = string.lower()
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    # Trim the string
    string = string.strip()
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    strings = [word for word in strings if word not in stops]
    strings = [nltk.PorterStemmer().stem(word) for word in strings]
    strings = ["bos"] + strings
    return strings

def setup():
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        labels = []
        titles = []
        descriptions = []
        trains = csv.reader(handler)
        for line in trains:
        return labels, titles, descriptions

def main():
    labels, titles, descriptions = setup()
    print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":


[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']]


  • Tokenization,对句子进行拆分。以单个词或者字符的形式存储。在文本清洗函数中,string.split(“ “)函数所执行的操作。
  • Normalization,将词语 正则化。lower函数和PorterStemmer函数做了这方面的工作。
  • Rare Word Replacement,稀松性较低的词进行替换。一般将词频小雨5的替换成一个特殊的Token <UNK>。此法降噪并能减少指点的大小。但由于使用的训练集和测试集中的词语较为集中而没有使用这个步骤。
  • Add <BOS><EOS>。添加每个句子的开始和结束标记。
  • Long Sentence Cut-Off or Short Sentence Padding。对于过长的句子进行截取,对于过短的句子进行补全,。



本章作为文本处理的准备工作,介绍了AG News数据集,以及文字的清洗、停用词等。为后续词向量训练做好准备。

  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 194,761评论 5 460
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 81,953评论 2 371
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 141,998评论 0 320
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 52,248评论 1 263
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 61,130评论 4 356
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 46,145评论 1 272
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 36,550评论 3 381
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 35,236评论 0 253
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 39,510评论 1 291
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 34,601评论 2 310
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 36,376评论 1 326
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,247评论 3 313
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 37,613评论 3 299
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 28,911评论 0 17
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,191评论 1 250
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 41,532评论 2 342
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 40,739评论 2 335
