无论是深度学习还是自然语言处理,一个非常重要的话题就是将自然语言转换成计算机可以识别的特征向量。文本的预处理一般都是通过文本分词 -> 词嵌入 -> 特征提取等步骤处理后,组件能够代表文本内容的矩阵向量。
词嵌入
可见,词嵌入是一个重要的概念。词嵌入,Word Embedding,也称为词向量。在深入了解这个概念之前,先看看几个例子,
- 在购买商品或者入住酒店后,邀请顾客填写评价来评估对服务的满意程度。
- 使用几个词组合在搜索引擎里搜索内容。
- 某些博客或新闻网站会自动标记一些相关的标签(tag)。
实际上这是文本处理后的应用,目的使用这些文本去做情绪分析、同义词聚类、文章分类和打标签。
作为读者,在读文章或者评论文章的时候,一般都可以准确地总结出这个文章大致讲了什么内容以及表达观点的倾向性。但是,于计算机是做到的呢?原来,可能只是数据库查询语句的匹配。但比如,在短视app搜索了“卷积”,并播放了一些相关视频,那么接下来和数学、深度学习有关的内容就会推荐。这说明,计算机对“卷积”这个词与数学和深度学习产生了关联。
词嵌入由此产生,它就是对文本的数字表示。通过器表示和计算可以很容易地使计算机得到如下公式,
卷积 -> 数学 + 深度学习 = 梯度 + 求导等等。
本章着重介绍词嵌入的文本处理方式,通过多种词嵌入的计算,循序渐进地讲解如何获取对应的词嵌入,和以往一样,最终通过一个文本分类的实战加深理解。
文本处理
下面通过一个新闻分类数据集来介绍文本处理。
数据集准备
新闻分类数据集AG是由学术社区ComeToMyHead提供,是从2000多个不同的新闻来源收集整理超过100万篇的新闻文章,用于研究分类、聚类、Rank、搜索等非商业领域。在此基础上有研究者为了研究需要,从中提取了127600个样本,其中120000个样本作为训练集、7600个样本作为测试集,并有以下4种分类,
- World
- Sports
- Business
- Sci/Tech
tensorflow_datasets提供了对AG News的数据集,代码如下,
import tensorflow_datasets as tfds
import jax
def setup():
(trains, tests), metas = tfds.load("ag_news_subset", split = [tfds.Split.TRAIN, tfds.Split.TEST], with_info = True, as_supervised = True, data_dir = "/tmp/")
class_names = metas.features["label"].names
number_classes = metas.features["label"].num_classes
number_trains = metas.splits["train"].num_examples
number_tests = metas.splits["test"].num_examples
buffer_size = 1000
batch_size = 50
trains = trains.shuffle(buffer_size)
trains = trains.batch(batch_size).prefetch(1)
trains = tfds.as_numpy(trains)
train_news = []
train_labels = []
for news, label in trains:
train_news.append(news)
train_labels.append(label)
tests = tests.batch(batch_size).prefetch(1)
tests = tfds.as_numpy(tests)
test_news = []
test_labels = []
for news, label in tests:
test_news.append(news)
test_labels.append(label)
return train_news, train_labels, test_news, test_labels
def main():
train_news, train_labels, test_news, test_labels = setup()
for (label, news) in zip(train_labels[0: 2], train_news[: 2]):
print(f"label = {label}, news = {news}")
if __name__ == "__main__":
main()
运行结果打印输出如下,
label = [3 0 0 1 1 3 2 0 1 3 2 0 0 2 1 1 3 1 3 3 2 2 2 0 0 0 2 2 0 1 3 3 2 0 0 1 2
3 1 0 2 0 0 1 1 3 0 0 0 1], news = [b'Sony BMG - aka #39;Bony #39; - the merged music label is in talks with Grokster, the P2P software company has confirmed. Negotiations are believed to be focused on the development of a new, label-friendly P2P network.'
b'AP - U.N. Secretary-General Kofi Annan, fending off Republican demands for his resignation over alleged corruption, said Thursday he will expand U.N. support for Iraqi elections if need be. He said he was not offended that President Bush did not ask to see him during this visit to Washington.'
b'World News: Islamabad, Oct 30 : Pakistan President Pervez Musharraf Saturday said a solution to the dragging Kashmir dispute could be found only if India and Pakistan agreed to move beyond their stated positions on the issue.'
b'If you believe Dale Earnhardt Jr. is hurting now, wait until late next month, should he fail to recoup the 25 points he lost for uttering a naughty word on national television following last week #39;s emotional victory at Talladega (Ala.'
b'Olympic rowing hero James Cracknell has revealed a lack of hunger was behind his decision to take a year away from the sport. The 32-year-old, who was part of the British gold-medal winning coxless fours in '
b'The PC maker is eyeing in-home services to go with its new consumer electronics line.'
b'It #39;s another level of security for America Online - but users will have to pay extra for it. AOL is offering an optional log-on service that will require more than just a password to get onto the service.'
b"NEW YORK - Gary Sheffield hit an RBI single, and the New York Yankees took a 1-0 lead over Pedro Martinez and the Boston Red Sox in Game 2 of the AL championship series Wednesday night. New York starter Jon Lieber allowed only one hit and a walk, stifling Boston's sluggers just as Mike Mussina did the night before..."
b'England captain Michael Vaughan said on Monday that he expected an evenly matched series against South Africa but suggested his bowlers could hold the key.'
b'Hoping to throw some tacks in the road to slow Linux momentum, Microsoft during the next year will redouble its efforts to woo more corporate users migrating from Unix to the open source OS.'
b'Corus, Britain #39;s biggest steelmaker, yesterday reported its first profits since the merger of British Steel and Dutch rival Hoogovens five years ago.'
…
当然,也可以直接下载ag_news的数据集文件,https://huggingface.co/datasets/pietrolesci/ag_news/tree/main,或留下信箱索取。
打开下载好的train.csv,可以看到共有3列,分别代表label分类,标题title和正文description。标题和正文使用“,”和“.“作为断句的符号。
由于数据集时由社区是自动化存储和收集的,难免存在大量的杂乱数据,比如,特殊字符等,使用前需要进行数据清洗。
下面使用下载好的csv文件来读取数据集。
import csv
def setup():
with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
trains = csv.reader(handler)
for line in trains:
print(line)
def main():
setup()
#trains, vidations, tests = setup()
#print(trains, vidations, tests)
if __name__ == "__main__":
main()
运行结果打印输出如下,
['3', 'Wall St. Bears Claw Back Into the Black (Reuters)', "Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."]
['3', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', 'Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.']
['3', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.']
['3', 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.']
['3', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)', 'AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections.']
…
由于是csv文本数据,每行默认以逗号分割,为了方便分类,可以使用不同的数组按照类别进行存储。当然,也可以根据需要使用pandas处理。为了后续操作和运算速度,这里主要使用Python原生函数和NumPy函数进行计算。
import csv
def setup():
with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
labels = []
titles = []
descriptions = []
trains = csv.reader(handler)
for line in trains:
labels.append(line[0])
titles.append(line[1])
descriptions.append(line[2])
return labels, titles, descriptions
def main():
labels, titles, descriptions = setup()
print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":
main()
运行结果打印输出如下,
['3', '3', '3', '3', '3'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)'] ['Wall St. Bears Claw Back Into the Black (Reuters)', 'Carlyle Looks Toward Commercial Aerospace (Reuters)', "Oil and Economy Cloud Stocks' Outlook (Reuters)", 'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters)', 'Oil prices soar to all-time record, posing new menace to US economy (AFP)']
可以看到,不同的列添加到不同的数组,为了统一,将所有的字母统一转换成小写,以便后续计算。
数据清洗
文本中除了常用的标点符号外,还包含大量的特殊字符,因此还需要对文本进行清洗。
文本清洗可以使用正则表达式,对于小写字母a到z、大小字母A到Z以及数字0到9之外的所有字符,都认为是特殊字符,用空格代替。代码如下,
def purify(string, pattern = r"[^a-z0-9]", replacement = " "):
string = re.sub(pattern = pattern, replacement, text)
代码里,re是Python的正则表达式包,字符串里^的意义是求反,即指定范围以外的字符集。通过更进一步的分析可以知道,文本清洗中除了将不需要的字符是用空格替换外,还产生了一个问题,即空格数目过多和在文本的首尾有空格残留。这同样影响文本的读取,因此还需要对替换符号后的文本二次处理,
def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
string = string.lower()
string = re.sub(pattern = pattern, repl = replacement, string = string)
# Replace the consucutive spaces with single space
string = re.sub(pattern = r" +", repl = replacement, string = string)
# Trim the string
string = string.strip()
# Seperate the string with space, an array will be yielded
strings = string.split(" ")
return strings
接下来在读取文本数据的时候应用新的清洗方法,
import csv
import re
import jax
def setup():
with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
labels = []
titles = []
descriptions = []
trains = csv.reader(handler)
for line in trains:
labels.append(jax.numpy.int32(line[0]))
titles.append(purify(line[1]))
descriptions.append(purify(line[2]))
return labels, titles, descriptions
def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " "):
string = string.lower()
string = re.sub(pattern = pattern, repl = replacement, string = string)
# Replace the consucutive spaces with single space
string = re.sub(pattern = r" +", repl = replacement, string = string)
# Trim the string
string = string.strip()
# Seperate the string with space, an array will be yielded
strings = string.split(" ")
return strings
def main():
labels, titles, descriptions = setup()
print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":
main()
运行结果打印输出如下,
[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']] [['wall', 'st', 'bears', 'claw', 'back', 'into', 'the', 'black', 'reuters'], ['carlyle', 'looks', 'toward', 'commercial', 'aerospace', 'reuters'], ['oil', 'and', 'economy', 'cloud', 'stocks', 'outlook', 'reuters'], ['iraq', 'halts', 'oil', 'exports', 'from', 'main', 'southern', 'pipeline', 'reuters'], ['oil', 'prices', 'soar', 'to', 'all', 'time', 'record', 'posing', 'new', 'menace', 'to', 'us', 'economy', 'afp']]
停用词
观察分词后的文本集,每组文本中除了能够表达含义的名称和动词外, 还有大量没有实际意义、去除后不影响含义的词,例如is、are、the等。这些词的存在并不会给句子增加太多的含义,反而会由于出现次数过多而影响后续的词嵌入分析。为了减少要处理的词汇、降低后续程序的复杂度,需要清除停用词。清除停用词一般使用NLTK的工具包。NLTK通过以下命令或者在IDE里安装,
pip install nltk
但仅仅安装NLTK并不能清除停用词,还需要下载NLTK停用词包,建议通过Python命令行进行NLTK控制台安装,在命令行输入python进入python命令行,然后依次输入以下命令,其中Config> d可选,
Python 3.9.2 (default, Feb 28 2021, 17:03:44)
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> c
Data Server:
- URL: <https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml>
- 7 Package Collections Available
- 113 Individual Packages Available
Local Machine:
- Data directory: /home/jinhui/nltk_data
---------------------------------------------------------------------------
s) Show Config u) Set Server URL d) Set Data Dir m) Main Menu
---------------------------------------------------------------------------
Config> d
New Directory> /tmp/
---------------------------------------------------------------------------
s) Show Config u) Set Server URL d) Set Data Dir m) Main Menu
---------------------------------------------------------------------------
Config> m
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> stopwords
Downloading package stopwords to /tmp/...
Package stopwords is already up-to-date!
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader>
这样在目录/tmp/corpora下就会有stopwords包。下面使用代码读取下载好的停用词,
import ssl
import nltk
def purify_stop_words():
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
nltk.data.path.append("/tmp/")
nltk.download("stopwords", download_dir = "/tmp/");
stops = nltk.corpus.stopwords.words("English")
print(stops)
return stops
def main():
#labels, titles, descriptions = setup()
#print(labels[: 5], titles[: 5], titles[: 5])
purify_stop_words()
if __name__ == "__main__":
main()
运行结果打印输出如下,
[nltk_data] Downloading package stopwords to /tmp/...
[nltk_data] Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
接下来是将停用词数据加载到文本清理器中。除此之外,由于英文文本的特殊性,单词会有不同形式的变形变,例如后缀ing、ed可以去掉,ies可以用y替换等等。这样可能会变成不是完整词的词根,只要将这个词的所有形式都还原成同一个词根即可。NLTK中对这部分词根还原处理使用下面的函数,
nltk.PorterStemmer().stem(word)
完整代码如下,
import csv
import re
import jax
import ssl
import nltk
def stop_words():
try:
_create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
pass
else:
ssl._create_default_https_context = _create_unverified_https_context
nltk.data.path.append("/tmp/")
nltk.download("stopwords", download_dir = "/tmp/");
stops = nltk.corpus.stopwords.words("English")
print(stops)
return stops
def purify(string: str, pattern: str = r"[^a-z0-9]", replacement: str = " ", stops = stop_words()):
string = string.lower()
string = re.sub(pattern = pattern, repl = replacement, string = string)
# Replace the consucutive spaces with single space
string = re.sub(pattern = r" +", repl = replacement, string = string)
# Trim the string
string = string.strip()
# Seperate the string with space, an array will be yielded
strings = string.split(" ")
strings = [word for word in strings if word not in stops]
strings = [nltk.PorterStemmer().stem(word) for word in strings]
strings.append("eos")
strings = ["bos"] + strings
return strings
def setup():
with open("../../Shares/ag_news_csv/train.csv", "r") as handler:
labels = []
titles = []
descriptions = []
trains = csv.reader(handler)
for line in trains:
labels.append(jax.numpy.float32(line[0]))
titles.append(purify(line[1]))
descriptions.append(purify(line[2]))
return labels, titles, descriptions
def main():
labels, titles, descriptions = setup()
print(labels[: 5], titles[: 5], titles[: 5])
if __name__ == "__main__":
main()
运行结果打印输出如下,
[Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32), Array(3., dtype=float32)] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']] [['bos', 'wall', 'st', 'bear', 'claw', 'back', 'black', 'reuter', 'eos'], ['bos', 'carlyl', 'look', 'toward', 'commerci', 'aerospac', 'reuter', 'eos'], ['bos', 'oil', 'economi', 'cloud', 'stock', 'outlook', 'reuter', 'eos'], ['bos', 'iraq', 'halt', 'oil', 'export', 'main', 'southern', 'pipelin', 'reuter', 'eos'], ['bos', 'oil', 'price', 'soar', 'time', 'record', 'pose', 'new', 'menac', 'us', 'economi', 'afp', 'eos']]
…
相对于未处理的文本,获取的是一个相对干净的文本数据。文本的清洗处理步骤总结如下,
- Tokenization,对句子进行拆分。以单个词或者字符的形式存储。在文本清洗函数中,string.split(“ “)函数所执行的操作。
- Normalization,将词语 正则化。lower函数和PorterStemmer函数做了这方面的工作。
- Rare Word Replacement,稀松性较低的词进行替换。一般将词频小雨5的替换成一个特殊的Token <UNK>。此法降噪并能减少指点的大小。但由于使用的训练集和测试集中的词语较为集中而没有使用这个步骤。
- Add <BOS><EOS>。添加每个句子的开始和结束标记。
- Long Sentence Cut-Off or Short Sentence Padding。对于过长的句子进行截取,对于过短的句子进行补全,。
由于模型的需要,在处理的时候并没有完全照搬使用上面的所有步骤。在不同性质的项目中,可以斟酌使用。
结论
本章作为文本处理的准备工作,介绍了AG News数据集,以及文字的清洗、停用词等。为后续词向量训练做好准备。