2. 文本向量化

在scikit-learn中，对文本数据进行特征提取，其实就是将文本数据转换为计算机能够处理的数字形式。Scikit-learning提供了三种向量化的方法，分别是：

CountVectorizer：用于将文本转换为词项数量的向量
HashingVectorizer：用于将文本转换为Hash值构成的向量
TfidfVectorizer：用于将文本转换为TF-IDF值构成的向量

这些向量化方法都在sklearn.feature_extraction.text下

CountVectorizer

先来看一下CountVectorizer的构造函数：

class sklearn.feature_extraction.text.CountVectorizer(
            input=u'content', 
            encoding=u'utf-8', 
            decode_error=u'strict', 
            strip_accents=None, 
            lowercase=True, 
            preprocessor=None, 
            tokenizer=None, 
            stop_words=None, 
            token_pattern=u'(?u)\b\w\w+\b', 
            ngram_range=(1, 1), 
            analyzer=u'word', 
            max_df=1.0, 
            min_df=1, 
            max_features=None, 
            vocabulary=None, 
            binary=False, 
            dtype=<type 'numpy.int64'>)

本文重点介绍以下几个输入参数，其中：

input input是我们要处理的文本数据。对于input参数，可以采用三种形式，分别是filename、file以及content，当然，此处也可以不输入，而是后面再输入corpus
lowercase 是否将输入的词项转换为小写模式，默认为True
analyzer 有四种选项，分别是{‘word’, ‘char’, ‘char_wb’}以及回调函数（自己定义的处理函数）
ngram_range 使用n-gram来构建词表
max_features 选择需要保留的特征的数量，默认为None，即所有特征全部保留
preprocessor 可以为None或者自定义的回调函数来对文本进行预处理

先来看一下在英文情况下，analyzer设置成word、char和char_wb的区别

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer

corps = [
    "When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.", 
    "Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index."
]


vectorizer = CountVectorizer(analyzer='word')
vectorizer.fit(corps)
vector = vectorizer.transform(corps)
print(vectorizer.vocabulary_)
print(vector.shape)
print(vector.toarray())

{'when': 58, 'building': 7, 'the': 53, 'vocabulary': 57, 'ignore': 24, 'terms': 50, 'that': 52, 'have': 21, 'document': 12, 'frequency': 17, 'strictly': 49, 'higher': 22, 'than': 51, 'given': 20, 'threshold': 55, 'corpus': 8, 'specific': 47, 'stop': 48, 'words': 60, 'if': 23, 'float': 16, 'parameter': 42, 'represents': 45, 'proportion': 43, 'of': 39, 'documents': 13, 'integer': 30, 'absolute': 0, 'counts': 9, 'this': 54, 'is': 31, 'ignored': 25, 'not': 38, 'none': 37, 'either': 14, 'mapping': 35, 'dict': 11, 'where': 59, 'keys': 33, 'are': 4, 'and': 2, 'values': 56, 'indices': 28, 'in': 26, 'feature': 15, 'matrix': 36, 'or': 40, 'an': 1, 'iterable': 32, 'over': 41, 'determined': 10, 'from': 18, 'input': 29, 'should': 46, 'be': 5, 'repeated': 44, 'any': 3, 'gap': 19, 'between': 6, 'largest': 34, 'index': 27}
(2, 61)
[[1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 2 1 1 0 0 0 0 1 2 0 0 0 0
  0 1 1 1 0 0 2 1 0 1 0 1 1 1 1 1 1 3 1 1 0 2 1 0 1]
 [0 1 3 1 2 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 2 1 2 1 0 1 1 1 1 2
  1 0 3 0 1 1 0 0 1 0 2 0 0 0 2 0 0 4 0 0 1 1 0 1 0]]

vectorizer = CountVectorizer(analyzer = 'char')
vectorizer.fit(corps)
vector = vectorizer.transform(corps)
print(vectorizer.vocabulary_)
print(vector.shape)
print(vector.toarray())

{'w': 28, 'h': 14, 'e': 11, 'n': 19, ' ': 0, 'b': 8, 'u': 26, 'i': 15, 'l': 17, 'd': 10, 'g': 13, 't': 25, 'v': 27, 'o': 20, 'c': 9, 'a': 7, 'r': 23, 'y': 30, 'm': 18, 's': 24, 'f': 12, 'q': 22, '(': 1, 'p': 21, '-': 4, ')': 2, '.': 5, ',': 3, 'k': 16, 'x': 29, '0': 6}
(2, 31)
[[40  1  1  2  1  3  0 15  4 10  6 27  6  6 12 16  0  7  5 16 19  8  1 20
  15 23  9  4  2  0  4]
 [53  1  1  3  0  5  1 22  4  5 13 37  3  6  9 18  1  6  8 20 10  7  0 16
  11 20  7  5  2  2  3]]

vectorizer = CountVectorizer(analyzer='char_wb')
vectorizer.fit(corps)
vector = vectorizer.transform(corps)
print(vectorizer.vocabulary_)
print(vector.shape)
print(vector.toarray())

{' ': 0, 'w': 28, 'h': 14, 'e': 11, 'n': 19, 'b': 8, 'u': 26, 'i': 15, 'l': 17, 'd': 10, 'g': 13, 't': 25, 'v': 27, 'o': 20, 'c': 9, 'a': 7, 'r': 23, 'y': 30, 'm': 18, 's': 24, 'f': 12, 'q': 22, '(': 1, 'p': 21, '-': 4, ')': 2, '.': 5, ',': 3, 'k': 16, 'x': 29, '0': 6}
(2, 31)
[[ 82   1   1   2   1   3   0  15   4  10   6  27   6   6  12  16   0   7
    5  16  19   8   1  20  15  23   9   4   2   0   4]
 [108   1   1   3   0   5   1  22   4   5  13  37   3   6   9  18   1   6
    8  20  10   7   0  16  11  20   7   5   2   2   3]]

可以看到，当analyzer设置成word时，CountVectorizer会按照词对文本进行统计，因此词表的大小明显为61（也就是说文本中共有61个不同的词）；当analyzer设置成char时，CountVectorizer会按照字母对文本进行统计，此时词表大小为31；当analyzer设置成char_wb时，从结果中并不能看出和char的差异，其实两者的差异主要是在于，char_wb是只在一个词内部（以空格为界限）进行字母的n-gram，来看下面的例子：

vectorizer = CountVectorizer(analyzer = 'char', ngram_range=(5,5))
vectorizer.fit(['Hello word'])
vector = vectorizer.transform(corps)
print(vectorizer.get_feature_names())


vectorizer = CountVectorizer(analyzer = 'char_wb', ngram_range=(5,5))
vectorizer.fit(['Hello word'])
vector = vectorizer.transform(corps)
print(vectorizer.get_feature_names())

[' word', 'ello ', 'hello', 'llo w', 'lo wo', 'o wor']
[' hell', ' word', 'ello ', 'hello', 'word ']

上面的例子也引出了CountVectorizer的另一个参数ngram_range，这个参数的含义比较好理解，当我们设置

ngram_range = (a, b)

a表示的是最小的n-gram，b表示的是最大选取多少个n-gram

TfidfVectorizer

同样，我们还是先看看的定义

class sklearn.feature_extraction.text.TfidfVectorizer(input=u'content', 
    encoding=u'utf-8', 
    decode_error=u'strict', 
    strip_accents=None, 
    lowercase=True, 
    preprocessor=None, 
    tokenizer=None, 
    analyzer=u'word', 
    stop_words=None, 
    token_pattern=u'(?u)\b\w\w+\b', 
    ngram_range=(1, 1), 
    max_df=1.0, 
    min_df=1, 
    max_features=None, 
    vocabulary=None, 
    binary=False, 
    dtype=<type 'numpy.int64'>, 
    norm=u'l2', 
    use_idf=True, 
    smooth_idf=True, 
    sublinear_tf=False)

该类的定义与CountVectorizer十分相似，在此不做过多的介绍了，只是简单介绍一些参数：

analyzer 只可以设置为word、char或者回调函数三种
norm 是否进行归一化，可以设置为l1、l2或者None，默认为None
use_idf 是否使用IDF权重，默认是使用的
smooth_idf 是否对IDF进行平滑（防止IDF值为0），默认为启用
sublinear_tf 是否对tf进行尺度变换，也就是将tf替换为1+log(tf)，默认不启用

TF-IDF模型是一种最常用向量空间模型，其示例代码与CountVectorizer基本一致。

vectorizer = HashingVectorizer(norm = 'l1')
vectorizer.fit(corps)
vector = vectorizer.transform(corps)
print(vector.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

接下来，看最后一个文本向量化的方法。

HashingVectorizer

对于任意一种语言而言，其词项都会有成千上万，而一篇文档中所使用的词项有非常有限，因此这样构造的特征向量往往会造成内存空间的浪费，为了避免这种情况，于是人们又想出了新的文本特征化方法，其基本原理可以参考https://www.cnblogs.com/pinard/p/6688348.html

对应于scikit-learning工具包，也提供了如下的实现，其构造函数如下：

class sklearn.feature_extraction.text.HashingVectorizer(input=u'content', 
    encoding=u'utf-8', 
    decode_error=u'strict', 
    strip_accents=None, 
    lowercase=True, 
    preprocessor=None, 
    tokenizer=None, 
    stop_words=None, 
    token_pattern=u'(?u)\b\w\w+\b', 
    ngram_range=(1, 1), 
    analyzer=u'word', 
    n_features=1048576, 
    binary=False, 
    norm=u'l2', 
    alternate_sign=True, 
    non_negative=False, 
    dtype=<type 'numpy.float64'>)

其中最重要的参数就是：

n_features 最终Hash之后的特征数

这个类在使用上与前面的类有所不同，不需要进行fit操作，只需要直接转换即可

vectorizer = HashingVectorizer(n_features = 20)
vector = vectorizer.transform(corps)
print(vector.shape)
print(vector.toarray())

(2, 20)
[[-0.26726124 -0.13363062  0.          0.          0.          0.13363062
   0.13363062  0.          0.40089186 -0.13363062  0.53452248  0.26726124
  -0.13363062  0.          0.          0.          0.          0.
  -0.40089186  0.40089186]
 [ 0.          0.12403473  0.          0.          0.24806947 -0.24806947
  -0.24806947 -0.24806947  0.49613894  0.3721042   0.12403473  0.12403473
   0.12403473  0.          0.12403473  0.12403473  0.24806947  0.12403473
  -0.3721042   0.24806947]]

最后需要说明的是，目前在scikit-learning的特征提取工具包中，并没有提供stem方法，需要结合ntlk的stem工具包来实现。目前看到已经有人在社区提到了这个问题，相信不久后的将来应该会提供stem的选项

最后编辑于：2018.08.08 22:32:59

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,056评论 5赞 474
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,842评论 2赞 378
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 148,938评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,296评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,292评论 5赞 363
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,413评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,824评论 3赞 393
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,493评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,686评论 1赞 295
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,502评论 2赞 318
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,553评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,281评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,820评论 3赞 305
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,873评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,109评论 1赞 258
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,699评论 2赞 348
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,257评论 2赞 341

2. 文本向量化

CountVectorizer

TfidfVectorizer

HashingVectorizer

推荐阅读更多精彩内容