scikit-learn: Machine Learning in Python——Supervised learning(一)(4)

http://www.ithao123.cn/content-647680.html

Classification of text documents using sparse features

This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words（词袋） approach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices.

The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.

The bar plot indicates the accuracy, training time(normalized) and test time(normalized) of each classifier.

解析：

(1)bag of words model

bag of words，也叫做“词袋”，在信息检索中，bag of words model假定对于一个文本，忽略其词序和语法，句法，将其仅仅看做是一个词集合，或者说是词的一个组合，文本中每个词的出现都是独立的，不依赖于其他词是否出现，或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。

词袋模型被用在文本分类的一些方法当中。当传统的贝叶斯分类被应用到文本当中时，贝叶斯中的条件独立性假设导致词袋模型。另外一些文本分类方法，比如，LDA和LSA也使用了这个模型。

(2)词袋模型缺点

BOW在传统NLP领域取得了巨大的成功，在计算机视觉领域(Computer Vision)也开始崭露头角，但在实际应用过程中，它却有一些不可避免的缺陷，比如：

稀疏性(Sparseness)：对于大词典，尤其是包括了生僻字的词典，文档稀疏性不可避免；

多义词(Polysem)：一词多义在文档中是常见的现象，BOW模型只统计单词出现的次数，而忽略了他们之间的区别；

同义词(Synonym)：同样的，在不同的文档中，或者在相同的文档中，可以有多个单词表示同一个意思。

从同义词和多义词问题我们可以看到，单词也许不是文档的最基本组成元素，在单词与文档之间还有一层隐含的关系，我们称之为主题(Topic)。我们在写文章时，首先想到的是文章的主题，然后才根据主题选择合适的单词来表达自己的观点。在BOW模型中引入Topic的因素，成为了大家研究的方向，这就是我们要讲的Latent Semantic Analysis (LSA)和Probabilitistic Latent Semantic Analysis(PLSA)，至于更复杂的Latent Dirichlet Allocation(LDA)和众多其他的Topic Models此处不再讨论。

(3)主题模型(Topic Model)

LFM(Latent Factor Model)：隐语义模型

LFI(Latent Semantic Indexing)：隐性语义索引；潜在语义索引

LFA(Latent Semantic Analysis)：隐性语义分析；潜在语义分析

TM(Topic Model)：话题模型；主题模型

EM(Expectation Maximization)：期望最大值

PLSA(Probabilistic Latent Semantic Analysis)：概率性潜在语义索引

LDA(Latent Dirichlet Allocation)：隐含狄利克雷分配

LCM(Latent Class Model)：隐含类别模型

LTM(Latent Topic Model)：隐含主题模型

MF(Matrix Factorization)：矩阵分解

PCA(Principal Component Analysis)：主成分分析

FA(Factor Analysis)：因子分析

ICA(Independent Component Analysis)：独立成分分析

隐语义模型LFM和LSI，LDA，Topic Model其实都属于隐含语义分析技术，是一类概念，他们在本质上是相通的，都是找出潜在的主题或分类。这些技术一开始都是在文本挖掘领域中提出来的，近些年它们也被不断应用到其他领域中，并得到了不错的应用效果。比如，主题模型最初是运用于自然语言处理相关方向，但目前以及延伸至例如生物信息学的其它领域。主题模型(Topic Model)在机器学习和自然语言处理等领域是用来在一系列文档中发现抽象主题的一种统计模型。

Python source code: document_classification_20newsgroups.py，如下所示：

# Author: Peter Prettenhofer # Olivier Grisel # Mathieu Blondel # Lars Buitinck # License: BSD 3 clause from __future__ import print_function import logging import numpy as np from optparse import OptionParser import sys from time import time import pylab as pl from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.linear_model import RidgeClassifier from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.linear_model import Perceptron from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.naive_bayes import BernoulliNB, MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.neighbors import NearestCentroid from sklearn.utils.extmath import density from sklearn import metrics # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') # parse commandline arguments op = OptionParser() op.add_option("--report", action="store_true", dest="print_report", help="Print a detailed classification report.") op.add_option("--chi2_select", action="store", type="int", dest="select_chi2", help="Select some number of features using a chi-squared test") op.add_option("--confusion_matrix", action="store_true", dest="print_cm", help="Print the confusion matrix.") op.add_option("--top10", action="store_true", dest="print_top10", help="Print ten most discriminative terms per class" " for every classifier.") op.add_option("--all_categories", action="store_true", dest="all_categories", help="Whether to use all categories or not.") op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.") op.add_option("--n_features", action="store", type=int, default=2 ** 16, help="n_features when using the hashing vectorizer.") op.add_option("--filtered", action="store_true", help="Remove newsgroup information that is easily overfit: " "headers, signatures, and quoting.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) print(__doc__) op.print_help() print() ############################################################################### # Load some categories from the training set if opts.all_categories: categories = None else: categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', ] if opts.filtered: remove = ('headers', 'footers', 'quotes') else: remove = () print("Loading 20 newsgroups dataset for categories:") print(categories if categories else "all") data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, remove=remove) data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42, remove=remove) print('data loaded') categories = data_train.target_names # for case categories == None def size_mb(docs): return sum(len(s.encode('utf-8')) for s in docs) / 1e6 data_train_size_mb = size_mb(data_train.data) data_test_size_mb = size_mb(data_test.data) print("%d documents - %0.3fMB (training set)" % ( len(data_train.data), data_train_size_mb)) print("%d documents - %0.3fMB (test set)" % ( len(data_test.data), data_test_size_mb)) print("%d categories" % len(categories)) print() # split a training set and a test set y_train, y_test = data_train.target, data_test.target print("Extracting features from the training dataset using a sparse vectorizer") t0 = time() if opts.use_hashing: vectorizer = HashingVectorizer(stop_words='english', non_negative=True, n_features=opts.n_features) X_train = vectorizer.transform(data_train.data) else: vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') X_train = vectorizer.fit_transform(data_train.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_train.shape) print() print("Extracting features from the test dataset using the same vectorizer") t0 = time() X_test = vectorizer.transform(data_test.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_test.shape) print() if opts.select_chi2: print("Extracting %d best features by a chi-squared test" % opts.select_chi2) t0 = time() ch2 = SelectKBest(chi2, k=opts.select_chi2) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) print("done in %fs" % (time() - t0)) print() def trim(s): """Trim string to fit on terminal (assuming 80-column display)""" return s if len(s) <= 80 else s[:77] + "..." # mapping from integer feature name to original token string if opts.use_hashing: feature_names = None else: feature_names = np.asarray(vectorizer.get_feature_names()) ############################################################################### # Benchmark classifiers def benchmark(clf): print('_' * 80) print("Training: ") print(clf) t0 = time() clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) score = metrics.f1_score(y_test, pred) print("f1-score: %0.3f" % score) if hasattr(clf, 'coef_'): print("dimensionality: %d" % clf.coef_.shape[1]) print("density: %f" % density(clf.coef_)) if opts.print_top10 and feature_names is not None: print("top 10 keywords per class:") for i, category in enumerate(categories): top10 = np.argsort(clf.coef_[i])[-10:] print(trim("%s: %s" % (category, " ".join(feature_names[top10])))) print() if opts.print_report: print("classification report:") print(metrics.classification_report(y_test, pred, target_names=categories)) if opts.print_cm: print("confusion matrix:") print(metrics.confusion_matrix(y_test, pred)) print() clf_descr = str(clf).split('(')[0] return clf_descr, score, train_time, test_time results = [] for clf, name in ( (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"), (Perceptron(n_iter=50), "Perceptron"), (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"), (KNeighborsClassifier(n_neighbors=10), "kNN")): print('=' * 80) print(name) results.append(benchmark(clf)) for penalty in ["l2", "l1"]: print('=' * 80) print("%s penalty" % penalty.upper()) # Train Liblinear model results.append(benchmark(LinearSVC(loss='l2', penalty=penalty, dual=False, tol=1e-3))) # Train SGD model results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty))) # Train SGD with Elastic Net penalty print('=' * 80) print("Elastic-Net penalty") results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"))) # Train NearestCentroid without threshold print('=' * 80) print("NearestCentroid (aka Rocchio classifier)") results.append(benchmark(NearestCentroid())) # Train sparse Naive Bayes classifiers print('=' * 80) print("Naive Bayes") results.append(benchmark(MultinomialNB(alpha=.01))) results.append(benchmark(BernoulliNB(alpha=.01))) class L1LinearSVC(LinearSVC): def fit(self, X, y): # The smaller C, the stronger the regularization. # The more regularization, the more sparsity. self.transformer_ = LinearSVC(penalty="l1", dual=False, tol=1e-3) X = self.transformer_.fit_transform(X, y) return LinearSVC.fit(self, X, y) def predict(self, X): X = self.transformer_.transform(X) return LinearSVC.predict(self, X) print('=' * 80) print("LinearSVC with L1-based feature selection") results.append(benchmark(L1LinearSVC())) # make some plots indices = np.arange(len(results)) results = [[x[i] for x in results] for i in range(4)] clf_names, score, training_time, test_time = results training_time = np.array(training_time) / np.max(training_time) test_time = np.array(test_time) / np.max(test_time) pl.figure(figsize=(12,8)) pl.title("Score") pl.barh(indices, score, .2, label="score", color='r') pl.barh(indices + .3, training_time, .2, label="training time", color='g') pl.barh(indices + .6, test_time, .2, label="test time", color='b') pl.yticks(()) pl.legend(loc='best') pl.subplots_adjust(left=.25) pl.subplots_adjust(top=.95) pl.subplots_adjust(bottom=.05) for i, c in zip(indices, clf_names): pl.text(-.3, i, c) pl.show()

解析：

(1)__future__模块

从Python 2.1开始，当一个新的语言特性首次出现在发行版中时，如果该特性与旧版Python不兼容，则该特性将被默认禁用。要启用这些特性，使用语句from __future__ import *。

Python 2.6实际已经支持新的print()语法，如下所示：

from __future__ import print_function print("fish", "panda", sep=', ')

(2)logging模块

这是Python的日志模块。

(3)optparse模块

optparse是专门用来在命令行添加选项的一个模块。

(4)pylab模块

简单理解，numpy，scipy和matplotlib的合体叫做pylab。

(5)sklearn模块

In [5]: import sklearn In [6]: sklearn. sklearn.base sklearn.linear_model sklearn.clone sklearn.manifold sklearn.cluster sklearn.metrics sklearn.covariance sklearn.mixture sklearn.cross_decomposition sklearn.naive_bayes sklearn.cross_validation sklearn.neighbors sklearn.datasets sklearn.pipeline sklearn.decomposition sklearn.pls sklearn.externals sklearn.preprocessing sklearn.feature_extraction sklearn.qda sklearn.feature_selection sklearn.semi_supervised sklearn.gaussian_process sklearn.setup_module sklearn.grid_search sklearn.svm sklearn.hmm sklearn.sys sklearn.isotonic sklearn.test sklearn.lda sklearn.utils

以上是sklearn模块所有的子模块。

(6)__name__属性和__doc__属性

__name__属性用于判断当前模块是不是程序入口，如果当前程序正在使用，那么__name__的值为__main__。在编写程序时，通常需要给每个模块添加条件语句，用于单独测试该模块的功能。

模块本身是一个对象，而每个对象都会有一个__doc__属性，该属性用于描述该对象的作用。

(7)编程战略

循序渐进，稳扎稳打。

1.1.2.1. Ridge Complexity

This method has the same order of complexity than an Ordinary Least Squares.

1.1.2.2. Setting the regularization parameter（正则化参数）: generalized Cross-Validation

RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out（留一法） cross-validation:

解析：

(1)广义线性模型是指线性模型及其简单的推广，包括岭回归，Lasso，LAR，Logistic回归，感知器等等。

(2)常见的交叉验证形式，如下所示：

Holdout验证：常识来说，Holdout验证并非一种交叉验证，因为数据并没有交叉使用。随机从最初的样本中选出部分，形成交叉验证数据，而剩余的就当做训练数据。一般来说，少于原本样本三分之一的数据被选做验证数据。

K-fold cross-validation：K折交叉验证，初始采样分割成K个子样本，一个单独的子样本被保留作为验证模型的数据，其他K-1个样本用来训练。交叉验证重复K次，每个子样本验证一次，平均K次的结果或者使用其它结合方式，最终得到一个单一估测。这个方法的优势在于，同时重复运用随机产生的子样本进行训练和验证，每次的结果验证一次，10折交叉验证是最常用的。

留一验证：正如名称所建议，留一验证(LOOCV)意指只使用原本样本中的一项来当做验证资料，而剩余的则留下来当做训练资料。这个步骤一直持续到每个样本都被当做一次验证资料。事实上，这等同于K-fold交叉验证是一样的，其中K为原本样本个数。在某些情况下是存在有效率的算法，如使用kernel regression和Tikhonov regularization。

>>> from sklearn import linear_model >>> clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0]) >>> clf.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, scoring=None, normalize=False) >>> clf.alpha_ 0.1

Also useful for lower-level tasks is the functionlasso_paththat computes the coefficients along the full path of possible values.

说明：

Regularized Least Squares：正则化最小二乘法。

1.1.3. Lasso

TheLassois a linear model that estimates sparse coefficients.It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero weights (seeCompressive sensing: tomography reconstruction with L1 prior (Lasso)).

Mathematically, it consists of a linear model trained with

prior as regularizer. The objective function to minimize is:

The lasso estimate thus solves the minimization of the least-squares penalty with

added, where

is a constant and

is the

-norm of the parameter vector.

The implementation in the classLassouses coordinate descent as the algorithm to fit the coefficients. SeeLeast Angle Regressionfor another implementation:

>>> clf = linear_model.Lasso(alpha = 0.1) >>> clf.fit([[0, 0], [1, 1]], [0, 1]) Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute='auto', tol=0.0001, warm_start=False) >>> clf.predict([[1, 1]]) array([ 0.8])

说明：

正则化(Regularization)、归一化(也有称为正规化/标准化，Normalization)是对数据进行预处理的方式，他们的目的都是为了让数据更便于我们的计算或获得更加泛化的结果，但并不改变问题的本质。

参考文献：

[1]bag of words model:http://zhidao.baidu.com/link?url=Zmq-EkUIF782nYYWmQ8jeCJaP7FLQljK0gqb0L-IHX8baGpxtd4liWFsTkdcY3YzAhbMtMSRuKfapEGtGnA3j_

[2] 词袋模型: http://blog.csdn.net/wanwenweifly4/article/details/6575905

[3] 主题模型：http://blog.csdn.net/huagong_adu/article/details/7937616

[4] LSA and PLSA: http://www.douban.com/note/63275934/

[5] 主题模型：http://zh.wikipedia.org/zh-cn/%E4%B8%BB%E9%A2%98%E6%A8%A1%E5%9E%8B

[6] Python: http://zh.wikipedia.org/zh-cn/Python

[7] python的日志logging模块学习: http://blog.csdn.net/yatere/article/details/6655445

[8] optparse之OptionParser: http://blog.csdn.net/yyt8yyt8/article/details/6999565

[9] python __doc__: http://blog.csdn.net/pi9nc/article/details/9397059

[10] 交叉验证：http://zh.wikipedia.org/zh-cn/%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89

最后编辑于：2017.11.27 03:11:27

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,547评论 6赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,399评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,428评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,599评论 1赞 274
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,612评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,577评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,941评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,603评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,852评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,605评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,693评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,375评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,955评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,936评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,172评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,970评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,414评论 2赞 342

scikit-learn: Machine Learning in Python——Supervised learning(一)(4)

推荐阅读更多精彩内容