http://www.ithao123.cn/content-647680.html
Classification of text documents using sparse features
This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words(词袋) approach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices.
The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached.
The bar plot indicates the accuracy, training time(normalized) and test time(normalized) of each classifier.
解析:
(1)bag of words model
bag of words,也叫做“词袋”,在信息检索中,bag of words model假定对于一个文本,忽略其词序和语法,句法,将其仅仅看做是一个词集合,或者说是词的一个组合,文本中每个词的出现都是独立的,不依赖于其他词是否出现,或者说当这篇文章的作者在任意一个位置选择一个词汇都不受前面句子的影响而独立选择的。
词袋模型被用在文本分类的一些方法当中。当传统的贝叶斯分类被应用到文本当中时,贝叶斯中的条件独立性假设导致词袋模型。另外一些文本分类方法,比如,LDA和LSA也使用了这个模型。
(2)词袋模型缺点
BOW在传统NLP领域取得了巨大的成功,在计算机视觉领域(Computer Vision)也开始崭露头角,但在实际应用过程中,它却有一些不可避免的缺陷,比如:
稀疏性(Sparseness):对于大词典,尤其是包括了生僻字的词典,文档稀疏性不可避免;
多义词(Polysem):一词多义在文档中是常见的现象,BOW模型只统计单词出现的次数,而忽略了他们之间的区别;
同义词(Synonym):同样的,在不同的文档中,或者在相同的文档中,可以有多个单词表示同一个意思。
从同义词和多义词问题我们可以看到,单词也许不是文档的最基本组成元素,在单词与文档之间还有一层隐含的关系,我们称之为主题(Topic)。我们在写文章时,首先想到的是文章的主题,然后才根据主题选择合适的单词来表达自己的观点。在BOW模型中引入Topic的因素,成为了大家研究的方向,这就是我们要讲的Latent Semantic Analysis (LSA)和Probabilitistic Latent Semantic Analysis(PLSA),至于更复杂的Latent Dirichlet Allocation(LDA)和众多其他的Topic Models此处不再讨论。
(3)主题模型(Topic Model)
LFM(Latent Factor Model):隐语义模型
LFI(Latent Semantic Indexing):隐性语义索引;潜在语义索引
LFA(Latent Semantic Analysis):隐性语义分析;潜在语义分析
TM(Topic Model):话题模型;主题模型
EM(Expectation Maximization):期望最大值
PLSA(Probabilistic Latent Semantic Analysis):概率性潜在语义索引
LDA(Latent Dirichlet Allocation):隐含狄利克雷分配
LCM(Latent Class Model):隐含类别模型
LTM(Latent Topic Model):隐含主题模型
MF(Matrix Factorization):矩阵分解
PCA(Principal Component Analysis):主成分分析
FA(Factor Analysis):因子分析
ICA(Independent Component Analysis):独立成分分析
隐语义模型LFM和LSI,LDA,Topic Model其实都属于隐含语义分析技术,是一类概念,他们在本质上是相通的,都是找出潜在的主题或分类。这些技术一开始都是在文本挖掘领域中提出来的,近些年它们也被不断应用到其他领域中,并得到了不错的应用效果。比如,主题模型最初是运用于自然语言处理相关方向,但目前以及延伸至例如生物信息学的其它领域。主题模型(Topic Model)在机器学习和自然语言处理等领域是用来在一系列文档中发现抽象主题的一种统计模型。
Python source code: document_classification_20newsgroups.py,如下所示:
# Author: Peter Prettenhofer # Olivier Grisel # Mathieu Blondel # Lars Buitinck # License: BSD 3 clause from __future__ import print_function import logging import numpy as np from optparse import OptionParser import sys from time import time import pylab as pl from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.linear_model import RidgeClassifier from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.linear_model import Perceptron from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.naive_bayes import BernoulliNB, MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.neighbors import NearestCentroid from sklearn.utils.extmath import density from sklearn import metrics # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') # parse commandline arguments op = OptionParser() op.add_option("--report", action="store_true", dest="print_report", help="Print a detailed classification report.") op.add_option("--chi2_select", action="store", type="int", dest="select_chi2", help="Select some number of features using a chi-squared test") op.add_option("--confusion_matrix", action="store_true", dest="print_cm", help="Print the confusion matrix.") op.add_option("--top10", action="store_true", dest="print_top10", help="Print ten most discriminative terms per class" " for every classifier.") op.add_option("--all_categories", action="store_true", dest="all_categories", help="Whether to use all categories or not.") op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.") op.add_option("--n_features", action="store", type=int, default=2 ** 16, help="n_features when using the hashing vectorizer.") op.add_option("--filtered", action="store_true", help="Remove newsgroup information that is easily overfit: " "headers, signatures, and quoting.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) print(__doc__) op.print_help() print() ############################################################################### # Load some categories from the training set if opts.all_categories: categories = None else: categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', ] if opts.filtered: remove = ('headers', 'footers', 'quotes') else: remove = () print("Loading 20 newsgroups dataset for categories:") print(categories if categories else "all") data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, remove=remove) data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42, remove=remove) print('data loaded') categories = data_train.target_names # for case categories == None def size_mb(docs): return sum(len(s.encode('utf-8')) for s in docs) / 1e6 data_train_size_mb = size_mb(data_train.data) data_test_size_mb = size_mb(data_test.data) print("%d documents - %0.3fMB (training set)" % ( len(data_train.data), data_train_size_mb)) print("%d documents - %0.3fMB (test set)" % ( len(data_test.data), data_test_size_mb)) print("%d categories" % len(categories)) print() # split a training set and a test set y_train, y_test = data_train.target, data_test.target print("Extracting features from the training dataset using a sparse vectorizer") t0 = time() if opts.use_hashing: vectorizer = HashingVectorizer(stop_words='english', non_negative=True, n_features=opts.n_features) X_train = vectorizer.transform(data_train.data) else: vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') X_train = vectorizer.fit_transform(data_train.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_train.shape) print() print("Extracting features from the test dataset using the same vectorizer") t0 = time() X_test = vectorizer.transform(data_test.data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration)) print("n_samples: %d, n_features: %d" % X_test.shape) print() if opts.select_chi2: print("Extracting %d best features by a chi-squared test" % opts.select_chi2) t0 = time() ch2 = SelectKBest(chi2, k=opts.select_chi2) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) print("done in %fs" % (time() - t0)) print() def trim(s): """Trim string to fit on terminal (assuming 80-column display)""" return s if len(s) <= 80 else s[:77] + "..." # mapping from integer feature name to original token string if opts.use_hashing: feature_names = None else: feature_names = np.asarray(vectorizer.get_feature_names()) ############################################################################### # Benchmark classifiers def benchmark(clf): print('_' * 80) print("Training: ") print(clf) t0 = time() clf.fit(X_train, y_train) train_time = time() - t0 print("train time: %0.3fs" % train_time) t0 = time() pred = clf.predict(X_test) test_time = time() - t0 print("test time: %0.3fs" % test_time) score = metrics.f1_score(y_test, pred) print("f1-score: %0.3f" % score) if hasattr(clf, 'coef_'): print("dimensionality: %d" % clf.coef_.shape[1]) print("density: %f" % density(clf.coef_)) if opts.print_top10 and feature_names is not None: print("top 10 keywords per class:") for i, category in enumerate(categories): top10 = np.argsort(clf.coef_[i])[-10:] print(trim("%s: %s" % (category, " ".join(feature_names[top10])))) print() if opts.print_report: print("classification report:") print(metrics.classification_report(y_test, pred, target_names=categories)) if opts.print_cm: print("confusion matrix:") print(metrics.confusion_matrix(y_test, pred)) print() clf_descr = str(clf).split('(')[0] return clf_descr, score, train_time, test_time results = [] for clf, name in ( (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"), (Perceptron(n_iter=50), "Perceptron"), (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"), (KNeighborsClassifier(n_neighbors=10), "kNN")): print('=' * 80) print(name) results.append(benchmark(clf)) for penalty in ["l2", "l1"]: print('=' * 80) print("%s penalty" % penalty.upper()) # Train Liblinear model results.append(benchmark(LinearSVC(loss='l2', penalty=penalty, dual=False, tol=1e-3))) # Train SGD model results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty=penalty))) # Train SGD with Elastic Net penalty print('=' * 80) print("Elastic-Net penalty") results.append(benchmark(SGDClassifier(alpha=.0001, n_iter=50, penalty="elasticnet"))) # Train NearestCentroid without threshold print('=' * 80) print("NearestCentroid (aka Rocchio classifier)") results.append(benchmark(NearestCentroid())) # Train sparse Naive Bayes classifiers print('=' * 80) print("Naive Bayes") results.append(benchmark(MultinomialNB(alpha=.01))) results.append(benchmark(BernoulliNB(alpha=.01))) class L1LinearSVC(LinearSVC): def fit(self, X, y): # The smaller C, the stronger the regularization. # The more regularization, the more sparsity. self.transformer_ = LinearSVC(penalty="l1", dual=False, tol=1e-3) X = self.transformer_.fit_transform(X, y) return LinearSVC.fit(self, X, y) def predict(self, X): X = self.transformer_.transform(X) return LinearSVC.predict(self, X) print('=' * 80) print("LinearSVC with L1-based feature selection") results.append(benchmark(L1LinearSVC())) # make some plots indices = np.arange(len(results)) results = [[x[i] for x in results] for i in range(4)] clf_names, score, training_time, test_time = results training_time = np.array(training_time) / np.max(training_time) test_time = np.array(test_time) / np.max(test_time) pl.figure(figsize=(12,8)) pl.title("Score") pl.barh(indices, score, .2, label="score", color='r') pl.barh(indices + .3, training_time, .2, label="training time", color='g') pl.barh(indices + .6, test_time, .2, label="test time", color='b') pl.yticks(()) pl.legend(loc='best') pl.subplots_adjust(left=.25) pl.subplots_adjust(top=.95) pl.subplots_adjust(bottom=.05) for i, c in zip(indices, clf_names): pl.text(-.3, i, c) pl.show()
解析:
(1)__future__模块
从Python 2.1开始,当一个新的语言特性首次出现在发行版中时,如果该特性与旧版Python不兼容,则该特性将被默认禁用。要启用这些特性,使用语句from __future__ import *。
Python 2.6实际已经支持新的print()语法,如下所示:
from __future__ import print_function print("fish", "panda", sep=', ')
(2)logging模块
这是Python的日志模块。
(3)optparse模块
optparse是专门用来在命令行添加选项的一个模块。
(4)pylab模块
简单理解,numpy,scipy和matplotlib的合体叫做pylab。
(5)sklearn模块
In [5]: import sklearn In [6]: sklearn. sklearn.base sklearn.linear_model sklearn.clone sklearn.manifold sklearn.cluster sklearn.metrics sklearn.covariance sklearn.mixture sklearn.cross_decomposition sklearn.naive_bayes sklearn.cross_validation sklearn.neighbors sklearn.datasets sklearn.pipeline sklearn.decomposition sklearn.pls sklearn.externals sklearn.preprocessing sklearn.feature_extraction sklearn.qda sklearn.feature_selection sklearn.semi_supervised sklearn.gaussian_process sklearn.setup_module sklearn.grid_search sklearn.svm sklearn.hmm sklearn.sys sklearn.isotonic sklearn.test sklearn.lda sklearn.utils
以上是sklearn模块所有的子模块。
(6)__name__属性和__doc__属性
__name__属性用于判断当前模块是不是程序入口,如果当前程序正在使用,那么__name__的值为__main__。在编写程序时,通常需要给每个模块添加条件语句,用于单独测试该模块的功能。
模块本身是一个对象,而每个对象都会有一个__doc__属性,该属性用于描述该对象的作用。
(7)编程战略
循序渐进,稳扎稳打。
1.1.2.1. Ridge Complexity
This method has the same order of complexity than an Ordinary Least Squares.
1.1.2.2. Setting the regularization parameter(正则化参数): generalized Cross-Validation
RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out(留一法) cross-validation:
解析:
(1)广义线性模型是指线性模型及其简单的推广,包括岭回归,Lasso,LAR,Logistic回归,感知器等等。
(2)常见的交叉验证形式,如下所示:
Holdout验证:常识来说,Holdout验证并非一种交叉验证,因为数据并没有交叉使用。随机从最初的样本中选出部分,形成交叉验证数据,而剩余的就当做训练数据。一般来说,少于原本样本三分之一的数据被选做验证数据。
K-fold cross-validation:K折交叉验证,初始采样分割成K个子样本,一个单独的子样本被保留作为验证模型的数据,其他K-1个样本用来训练。交叉验证重复K次,每个子样本验证一次,平均K次的结果或者使用其它结合方式,最终得到一个单一估测。这个方法的优势在于,同时重复运用随机产生的子样本进行训练和验证,每次的结果验证一次,10折交叉验证是最常用的。
留一验证:正如名称所建议,留一验证(LOOCV)意指只使用原本样本中的一项来当做验证资料,而剩余的则留下来当做训练资料。这个步骤一直持续到每个样本都被当做一次验证资料。事实上,这等同于K-fold交叉验证是一样的,其中K为原本样本个数。在某些情况下是存在有效率的算法,如使用kernel regression和Tikhonov regularization。
>>> from sklearn import linear_model >>> clf = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0]) >>> clf.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1]) RidgeCV(alphas=[0.1, 1.0, 10.0], cv=None, fit_intercept=True, scoring=None, normalize=False) >>> clf.alpha_ 0.1
Also useful for lower-level tasks is the functionlasso_paththat computes the coefficients along the full path of possible values.
说明:
Regularized Least Squares:正则化最小二乘法。
1.1.3. Lasso
TheLassois a linear model that estimates sparse coefficients.It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent. For this reason, the Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero weights (seeCompressive sensing: tomography reconstruction with L1 prior (Lasso)).
Mathematically, it consists of a linear model trained with
The lasso estimate thus solves the minimization of the least-squares penalty with
The implementation in the classLassouses coordinate descent as the algorithm to fit the coefficients. SeeLeast Angle Regressionfor another implementation:
>>> clf = linear_model.Lasso(alpha = 0.1) >>> clf.fit([[0, 0], [1, 1]], [0, 1]) Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute='auto', tol=0.0001, warm_start=False) >>> clf.predict([[1, 1]]) array([ 0.8])
说明:
正则化(Regularization)、归一化(也有称为正规化/标准化,Normalization)是对数据进行预处理的方式,他们的目的都是为了让数据更便于我们的计算或获得更加泛化的结果,但并不改变问题的本质。
参考文献:
[1]bag of words model:http://zhidao.baidu.com/link?url=Zmq-EkUIF782nYYWmQ8jeCJaP7FLQljK0gqb0L-IHX8baGpxtd4liWFsTkdcY3YzAhbMtMSRuKfapEGtGnA3j_
[2] 词袋模型: http://blog.csdn.net/wanwenweifly4/article/details/6575905
[3] 主题模型:http://blog.csdn.net/huagong_adu/article/details/7937616
[4] LSA and PLSA: http://www.douban.com/note/63275934/
[5] 主题模型:http://zh.wikipedia.org/zh-cn/%E4%B8%BB%E9%A2%98%E6%A8%A1%E5%9E%8B
[6] Python: http://zh.wikipedia.org/zh-cn/Python
[7] python的日志logging模块学习: http://blog.csdn.net/yatere/article/details/6655445
[8] optparse之OptionParser: http://blog.csdn.net/yyt8yyt8/article/details/6999565
[9] python __doc__: http://blog.csdn.net/pi9nc/article/details/9397059
[10] 交叉验证:http://zh.wikipedia.org/zh-cn/%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89