第52章文本数据预处理

无论是深度学习还是自然语言处理，一个非常重要的话题就是将自然语言转换成计算机可以识别的特征向量。文本的预处理一般都是通过文本分词 -> 词嵌入 -> 特征提取等步骤处理后，组件能够代表文本内容的矩阵向量。

词嵌入

可见，词嵌入是一个重要的概念。词嵌入，Word Embedding，也称为词向量。在深入了解这个概念之前，先看看几个例子，

在购买商品或者入住酒店后，邀请顾客填写评价来评估对服务的满意程度。
使用几个词组合在搜索引擎里搜索内容。
某些博客或新闻网站会自动标记一些相关的标签（tag）。

实际上这是文本处理后的应用，目的使用这些文本去做情绪分析、同义词聚类、文章分类和打标签。

作为读者，在读文章或者评论文章的时候，一般都可以准确地总结出这个文章大致讲了什么内容以及表达观点的倾向性。但是，于计算机是做到的呢？原来，可能只是数据库查询语句的匹配。但比如，在短视app搜索了“卷积”，并播放了一些相关视频，那么接下来和数学、深度学习有关的内容就会推荐。这说明，计算机对“卷积”这个词与数学和深度学习产生了关联。

词嵌入由此产生，它就是对文本的数字表示。通过器表示和计算可以很容易地使计算机得到如下公式，

卷积 -> 数学 + 深度学习 = 梯度 + 求导等等。

本章着重介绍词嵌入的文本处理方式，通过多种词嵌入的计算，循序渐进地讲解如何获取对应的词嵌入，和以往一样，最终通过一个文本分类的实战加深理解。

文本处理

下面通过一个新闻分类数据集来介绍文本处理。

数据集准备

新闻分类数据集AG是由学术社区ComeToMyHead提供，是从2000多个不同的新闻来源收集整理超过100万篇的新闻文章，用于研究分类、聚类、Rank、搜索等非商业领域。在此基础上有研究者为了研究需要，从中提取了127600个样本，其中120000个样本作为训练集、7600个样本作为测试集，并有以下4种分类，

World
Sports
Business
Sci/Tech

tensorflow_datasets提供了对AG News的数据集，代码如下，


import tensorflow_datasets as tfds
import jax

def setup():
    
    (trains, tests), metas = tfds.load("ag_news_subset", split = [tfds.Split.TRAIN, tfds.Split.TEST], with_info = True, as_supervised = True, data_dir = "/tmp/")

    class_names = metas.features["label"].names
    number_classes = metas.features["label"].num_classes
    
    number_trains = metas.splits["train"].num_examples
    number_tests = metas.splits["test"].num_examples
    
    buffer_size = 1000
    batch_size = 50
    
    trains = trains.shuffle(buffer_size)
    trains = trains.batch(batch_size).prefetch(1)
    trains = tfds.as_numpy(trains)
    
    train_news = []
    train_labels = []
    
    for news, label in trains:
        
        train_news.append(news)
        train_labels.append(label)
        
    tests = tests.batch(batch_size).prefetch(1)
    tests = tfds.as_numpy(tests)
    
    test_news = []
    test_labels = []
    
    for news, label in tests:
        
        test_news.append(news)
        test_labels.append(label)
    
    return train_news, train_labels, test_news, test_labels
    
def main():
    
    train_news, train_labels, test_news, test_labels = setup()
    
    for (label, news) in zip(train_labels[0: 2], train_news[: 2]):
        
        print(f"label = {label}, news = {news}")
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


label = [3 0 0 1 1 3 2 0 1 3 2 0 0 2 1 1 3 1 3 3 2 2 2 0 0 0 2 2 0 1 3 3 2 0 0 1 2
 3 1 0 2 0 0 1 1 3 0 0 0 1], news = [b'Sony BMG - aka  #39;Bony #39; - the merged music label is in talks with Grokster, the P2P software company has confirmed. Negotiations are believed to be focused on the development of a new, label-friendly P2P network.'
 b'AP - U.N. Secretary-General Kofi Annan, fending off Republican demands for his resignation over alleged corruption, said Thursday he will expand U.N. support for Iraqi elections if need be. He said he was not offended that President Bush did not ask to see him during this visit to Washington.'
 b'World News: Islamabad, Oct 30 : Pakistan President Pervez Musharraf Saturday said a solution to the dragging Kashmir dispute could be found only if India and Pakistan agreed to move beyond their stated positions on the issue.'
 b'If you believe Dale Earnhardt Jr. is hurting now, wait until late next month, should he fail to recoup the 25 points he lost for uttering a naughty word on national television following last week #39;s emotional victory at Talladega (Ala.'
 b'Olympic rowing hero James Cracknell has revealed a lack of hunger was behind his decision to take a year away from the sport. The 32-year-old, who was part of the British gold-medal winning coxless fours in '
 b'The PC maker is eyeing in-home services to go with its new consumer electronics line.'
 b'It #39;s another level of security for America Online - but users will have to pay extra for it. AOL is offering an optional log-on service that will require more than just a password to get onto the service.'
 b"NEW YORK - Gary Sheffield hit an RBI single, and the New York Yankees took a 1-0 lead over Pedro Martinez and the Boston Red Sox in Game 2 of the AL championship series Wednesday night.    New York starter Jon Lieber allowed only one hit and a walk, stifling Boston's sluggers just as Mike Mussina did the night before..."
 b'England captain Michael Vaughan said on Monday that he expected an evenly matched series against South Africa but suggested his bowlers could hold the key.'
 b'Hoping to throw some tacks in the road to slow Linux momentum, Microsoft during the next year will redouble its efforts to woo more corporate users migrating from Unix to the open source OS.'
 b'Corus, Britain #39;s biggest steelmaker, yesterday reported its first profits since the merger of British Steel and Dutch rival Hoogovens five years ago.'
…

当然，也可以直接下载ag_news的数据集文件，https://huggingface.co/datasets/pietrolesci/ag_news/tree/main，或留下信箱索取。

打开下载好的train.csv，可以看到共有3列，分别代表label分类，标题title和正文description。标题和正文使用“,”和“.“作为断句的符号。

图1 train.csv预览

由于数据集时由社区是自动化存储和收集的，难免存在大量的杂乱数据，比如，特殊字符等，使用前需要进行数据清洗。

下面使用下载好的csv文件来读取数据集。


import csv

def setup():
    
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        
        trains = csv.reader(handler)
        
        for line in trains:
            
            print(line)
    
    
def main():
    
    setup()
    
    #trains, vidations, tests = setup()
    
    #print(trains, vidations, tests)
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


['3', 'Wall St. Bears Claw Back Into the Black (Reuters)', "Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."]
['3', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.']
['3', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.']
['3', 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.']
['3', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)', 'AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.']
…

由于是csv文本数据，每行默认以逗号分割，为了方便分类，可以使用不同的数组按照类别进行存储。当然，也可以根据需要使用pandas处理。为了后续操作和运算速度，这里主要使用Python原生函数和NumPy函数进行计算。


import csv

def setup():
    
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        
        labels = []
        titles = []
        descriptions = []
        
        trains = csv.reader(handler)
        
        for line in trains:
            
            labels.append(line[0])
            titles.append(line[1])
            descriptions.append(line[2])
            
        return labels, titles, descriptions
    
def main():
    
    labels, titles, descriptions = setup()
    
    print(labels[: 5], titles[: 5], titles[: 5])
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


['3', '3', '3', '3', '3'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)']

可以看到，不同的列添加到不同的数组，为了统一，将所有的字母统一转换成小写，以便后续计算。

数据清洗

文本中除了常用的标点符号外，还包含大量的特殊字符，因此还需要对文本进行清洗。

文本清洗可以使用正则表达式，对于小写字母a到z、大小字母A到Z以及数字0到9之外的所有字符，都认为是特殊字符，用空格代替。代码如下，


def purify(string, pattern = r"[^a-z0-9]", replacement = " "):
    
    string = re.sub(pattern = pattern, replacement, text)

代码里，re是Python的正则表达式包，字符串里^的意义是求反，即指定范围以外的字符集。通过更进一步的分析可以知道，文本清洗中除了将不需要的字符是用空格替换外，还产生了一个问题，即空格数目过多和在文本的首尾有空格残留。这同样影响文本的读取，因此还需要对替换符号后的文本二次处理，


def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
    
    string = string.lower()
    
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    
    # Trim the string
    string = string.strip()
    
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    
    return strings

接下来在读取文本数据的时候应用新的清洗方法，


import csv
import re
import jax

def setup():
    
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        
        labels = []
        titles = []
        descriptions = []
        
        trains = csv.reader(handler)
        
        for line in trains:
            
            labels.append(jax.numpy.int32(line[0]))
            titles.append(purify(line[1]))
            descriptions.append(purify(line[2]))
            
        return labels, titles, descriptions
    
def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
    
    string = string.lower()
    
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    
    # Trim the string
    string = string.strip()
    
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    
    return strings
    
    
def main():
    
    labels, titles, descriptions = setup()
    
    print(labels[: 5], titles[: 5], titles[: 5])
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']]

停用词

观察分词后的文本集，每组文本中除了能够表达含义的名称和动词外，还有大量没有实际意义、去除后不影响含义的词，例如is、are、the等。这些词的存在并不会给句子增加太多的含义，反而会由于出现次数过多而影响后续的词嵌入分析。为了减少要处理的词汇、降低后续程序的复杂度，需要清除停用词。清除停用词一般使用NLTK的工具包。NLTK通过以下命令或者在IDE里安装，


pip install nltk

但仅仅安装NLTK并不能清除停用词，还需要下载NLTK停用词包，建议通过Python命令行进行NLTK控制台安装，在命令行输入python进入python命令行，然后依次输入以下命令，其中Config> d可选，


Python 3.9.2 (default, Feb 28 2021, 17:03:44) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> c

Data Server:
  - URL: <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml>
  - 7 Package Collections Available
  - 113 Individual Packages Available

Local Machine:
  - Data directory: /home/jinhui/nltk_data

---------------------------------------------------------------------------
    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
---------------------------------------------------------------------------
Config> d
  New Directory> /tmp/

---------------------------------------------------------------------------
    s) Show Config   u) Set Server URL   d) Set Data Dir   m) Main Menu
---------------------------------------------------------------------------
Config> m

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> stopwords
    Downloading package stopwords to /tmp/...
      Package stopwords is already up-to-date!

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader>

这样在目录/tmp/corpora下就会有stopwords包。下面使用代码读取下载好的停用词，


import ssl
import nltk
    
def purify_stop_words():
    
    try:
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        pass
    else:
        ssl._create_default_https_context = _create_unverified_https_context
    
    nltk.data.path.append("/tmp/")
    
    nltk.download("stopwords", download_dir = "/tmp/");
    
    stops = nltk.corpus.stopwords.words("English")
    
    print(stops)
    
    return stops

def main():
    
    #labels, titles, descriptions = setup()
    
    #print(labels[: 5], titles[: 5], titles[: 5])
    
    purify_stop_words()
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


[nltk_data] Downloading package stopwords to /tmp/...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

接下来是将停用词数据加载到文本清理器中。除此之外，由于英文文本的特殊性，单词会有不同形式的变形变，例如后缀ing、ed可以去掉，ies可以用y替换等等。这样可能会变成不是完整词的词根，只要将这个词的所有形式都还原成同一个词根即可。NLTK中对这部分词根还原处理使用下面的函数，


nltk.PorterStemmer().stem(word)

完整代码如下，


import csv
import re
import jax
import ssl
import nltk
    
def stop_words():
    
    try:
        _create_unverified_https_context = ssl._create_unverified_context
    except AttributeError:
        pass
    else:
        ssl._create_default_https_context = _create_unverified_https_context
    
    nltk.data.path.append("/tmp/")
    
    nltk.download("stopwords", download_dir = "/tmp/");
    
    stops = nltk.corpus.stopwords.words("English")
    
    print(stops)
    
    return stops

def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " ", stops = stop_words()):
    
    string = string.lower()
    
    string = re.sub(pattern = pattern, repl = replacement, string = string)
    # Replace the consucutive spaces with single space
    string = re.sub(pattern = r" +",  repl = replacement, string = string)
    
    # Trim the string
    string = string.strip()
    
    # Seperate the string with space, an array will be yielded
    strings = string.split(" ")
    
    strings = [word for word in strings if word not in stops]
    strings = [nltk.PorterStemmer().stem(word) for word in strings]
    
    strings.append("eos")
    strings = ["bos"] + strings
    
    return strings

def setup():
    
    with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
        
        labels = []
        titles = []
        descriptions = []
        
        trains = csv.reader(handler)
        
        for line in trains:
            
            labels.append(jax.numpy.float32(line[0]))
            titles.append(purify(line[1]))
            descriptions.append(purify(line[2]))
            
        return labels, titles, descriptions

    
def main():
    
    labels, titles, descriptions = setup()
    
    print(labels[: 5], titles[: 5], titles[: 5])
        
if __name__ == "__main__":
    
    main()

运行结果打印输出如下，


[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']]
…

相对于未处理的文本，获取的是一个相对干净的文本数据。文本的清洗处理步骤总结如下，

Tokenization，对句子进行拆分。以单个词或者字符的形式存储。在文本清洗函数中，string.split(“ “)函数所执行的操作。
Normalization，将词语正则化。lower函数和PorterStemmer函数做了这方面的工作。
Rare Word Replacement，稀松性较低的词进行替换。一般将词频小雨5的替换成一个特殊的Token <UNK>。此法降噪并能减少指点的大小。但由于使用的训练集和测试集中的词语较为集中而没有使用这个步骤。
Add <BOS><EOS>。添加每个句子的开始和结束标记。
Long Sentence Cut-Off or Short Sentence Padding。对于过长的句子进行截取，对于过短的句子进行补全，。

由于模型的需要，在处理的时候并没有完全照搬使用上面的所有步骤。在不同性质的项目中，可以斟酌使用。

结论

本章作为文本处理的准备工作，介绍了AG News数据集，以及文字的清洗、停用词等。为后续词向量训练做好准备。

最后编辑于：2023.05.05 21:24:15

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,761评论 5赞 460
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,953评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,998评论 0赞 320
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,248评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,130评论 4赞 356
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,145评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,550评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,236评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,510评论 1赞 291
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,601评论 2赞 310
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,376评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,247评论 3赞 313
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,613评论 3赞 299
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,911评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,191评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,532评论 2赞 342
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,739评论 2赞 335

第52章 文本数据预处理

词嵌入

文本处理

数据集准备

数据清洗

停用词

结论

推荐阅读更多精彩内容

第52章文本数据预处理