前言
此程序基于新闻文本分类实验
使用朴素贝叶斯(Naive Bayes Classifier)模型实现分类任务。
本程序可以流畅运行于Python3.6环境,但是Python2.x版本需要修正的地方也已经在注释中说明。
requirements:pandas,numpy,scikit-learn
想查看其他经典算法实现可以关注查看本人其他文集。
实验结果分析
朴素贝叶斯模型被广泛应用于海量互联网文本分类任务。由于其较强的特征条件假设,使得模型预测所需要估计的参数规模从幂指数量级向线性量级减少,极大地节约了内存消耗和计算时间。但是,也正是受这种强假设的限制,模型训练无法将各个特征之间的联系考量在内,使得该模型在其他数据特征关联性较强的分类任务上的性能表现不佳。
程序源码
#import news data from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups
#this instruction need internet downloading news data
news=fetch_20newsgroups(subset='all')
#check the details and scale of news data
# print(len(news.data))
# print(news.data[0])
#data preprocessing
#notes:you should use cross_valiation instead of model_valiation in python 2.7
#from sklearn.cross_validation import train_test_split #DeprecationWarning
from sklearn.model_selection import train_test_split #use train_test_split module of sklearn.model_valiation to split data
#take 25 percent of data randomly for testing,and others for training
X_train,X_test,y_train,y_test = train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#import text features transforming module for extracting text important features
from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
X_test=vec.transform(X_test)
#import and initialize naive bayes model in default setting
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
#training model by trainning set
mnb.fit(X_train,y_train)
#predict the target names of tests set
y_predict=mnb.predict(X_test)
#import classification report to evaluate model performance
from sklearn.metrics import classification_report
#get accuracy by the score function in lsvc model
print('The accuracy of Naive Bayes Classifier is',mnb.score(X_test,y_test))
#get precision ,recall and f1-score from classification_report module
print(classification_report(y_test,y_predict,target_names=news.target_names))
Ubuntu16.04 Python3.6 程序输出结果:
The accuracy of Naive Bayes Classifier is 0.8397707979626485
precision recall f1-score support
alt.atheism 0.86 0.86 0.86 201
comp.graphics 0.59 0.86 0.70 250
comp.os.ms-windows.misc 0.89 0.10 0.17 248
comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240
comp.sys.mac.hardware 0.93 0.78 0.85 242
comp.windows.x 0.82 0.84 0.83 263
misc.forsale 0.91 0.70 0.79 257
rec.autos 0.89 0.89 0.89 238
rec.motorcycles 0.98 0.92 0.95 276
rec.sport.baseball 0.98 0.91 0.95 251
rec.sport.hockey 0.93 0.99 0.96 233
sci.crypt 0.86 0.98 0.91 238
sci.electronics 0.85 0.88 0.86 249
sci.med 0.92 0.94 0.93 245
sci.space 0.89 0.96 0.92 221
soc.religion.christian 0.78 0.96 0.86 232
talk.politics.guns 0.88 0.96 0.92 251
talk.politics.mideast 0.90 0.98 0.94 231
talk.politics.misc 0.79 0.89 0.84 188
talk.religion.misc 0.93 0.44 0.60 158
avg / total 0.86 0.84 0.82 4712
[Finished in 4.6s]
欢迎指正错误,包括英语和程序错误。有问题也欢迎提问,一起加油一起进步。
本程序完全是本人逐字符输入的劳动结果,转载请注明出处。