【机器学习实验】scikit-learn的主要模块和基本使用

引言

对于一些开始搞机器学习算法有害怕下手的小朋友，该如何快速入门，这让人挺挣扎的。

在从事数据科学的人中，最常用的工具就是R和Python了，每个工具都有其利弊，但是Python在各方面都相对胜出一些，这是因为scikit-learn库实现了很多机器学习算法。

我们假设输入时一个特征矩阵或者csv文件。

首先，数据应该被载入内存中。

scikit-learn的实现使用了NumPy中的arrays，所以，我们要使用NumPy来载入csv文件。

以下是从UCI机器学习数据仓库中下载的数据。

importnumpyasnpimporturllib# url with dataseturl ="http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"# download the fileraw_data = urllib.urlopen(url)# load the CSV file as a numpy matrixdataset = np.loadtxt(raw_data, delimiter=",")# separate the data from the target attributesX = dataset[:,0:7]y = dataset[:,8]

我们要使用该数据集作为例子，将特征矩阵作为X，目标变量作为y。

数据归一化(Data Normalization)

大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的，在开始跑算法之前，我们应该进行归一化或者标准化的过程，这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法：

fromsklearnimportpreprocessing# normalize the data attributesnormalized_X = preprocessing.normalize(X)# standardize the data attributesstandardized_X = preprocessing.scale(X)

特征选择(Feature Selection)

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。

特征选择时一个很需要创造力的过程，更多的依赖于直觉和专业知识，并且有很多现成的算法来进行特征的选择。

下面的树算法(Tree algorithms)计算特征的信息量：

fromsklearnimportmetricsfromsklearn.ensembleimportExtraTreesClassifiermodel = ExtraTreesClassifier()model.fit(X, y)# display the relative importance of each attributeprint(model.feature_importances_)

算法的使用

scikit-learn实现了机器学习的大部分基础算法，让我们快速了解一下。

逻辑回归

大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。

fromsklearnimportmetricsfromsklearn.linear_modelimportLogisticRegressionmodel = LogisticRegression()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

结果：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)

precision recall f1-score support

0.0 0.79 0.89 0.84 500

1.0 0.74 0.55 0.63 268

avg / total 0.77 0.77 0.77 768

[[447 53]

[120 148]]

朴素贝叶斯

这也是著名的机器学习算法，该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。

fromsklearnimportmetricsfromsklearn.naive_bayesimportGaussianNBmodel = GaussianNB()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

结果：

GaussianNB()

precision recall f1-score support

0.0 0.80 0.86 0.83 500

1.0 0.69 0.60 0.64 268

avg / total 0.76 0.77 0.76 768

[[429 71]

[108 160]]

k近邻

k近邻算法常常被用作是分类算法一部分，比如可以用它来评估特征，在特征选择上我们可以用到它。

fromsklearnimportmetricsfromsklearn.neighborsimportKNeighborsClassifier# fit a k-nearest neighbor model to the datamodel = KNeighborsClassifier()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

结果：

KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,

n_neighbors=5, p=2, weights=uniform)

precision recall f1-score support

0.0 0.82 0.90 0.86 500

1.0 0.77 0.63 0.69 268

avg / total 0.80 0.80 0.80 768

[[448 52]

[ 98 170]]

决策树

分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题，这种方法非常适用于多分类情况。

fromsklearnimportmetricsfromsklearn.treeimportDecisionTreeClassifier# fit a CART model to the datamodel = DecisionTreeClassifier()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

结果：

DecisionTreeClassifier(compute_importances=None, criterion=gini,

max_depth=None, max_features=None, min_density=None,

min_samples_leaf=1, min_samples_split=2, random_state=None,

splitter=best)

precision recall f1-score support

0.0 1.00 1.00 1.00 500

1.0 1.00 1.00 1.00 268

avg / total 1.00 1.00 1.00 768

[[500 0]

[ 0 268]]

支持向量机

SVM是非常流行的机器学习算法，主要用于分类问题，如同逻辑回归问题，它可以使用一对多的方法进行多类别的分类。

fromsklearnimportmetricsfromsklearn.svmimportSVC# fit a SVM model to the datamodel = SVC()model.fit(X, y)print(model)# make predictionsexpected = ypredicted = model.predict(X)# summarize the fit of the modelprint(metrics.classification_report(expected, predicted))print(metrics.confusion_matrix(expected, predicted))

结果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,

kernel=rbf, max_iter=-1, probability=False, random_state=None,

shrinking=True, tol=0.001, verbose=False)

precision recall f1-score support

0.0 1.00 1.00 1.00 500

1.0 1.00 1.00 1.00 268

avg / total 1.00 1.00 1.00 768

[[500 0]

[ 0 268]]

除了分类和回归算法外，scikit-learn提供了更加复杂的算法，比如聚类算法，还实现了算法组合的技术，如Bagging和Boosting算法。

如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。

下面的例子是一个进行正则参数选择的程序：

importnumpyasnpfromsklearn.linear_modelimportRidgefromsklearn.grid_searchimportGridSearchCV# prepare a range of alpha values to testalphas = np.array([1,0.1,0.01,0.001,0.0001,0])# create and fit a ridge regression model, testing each alphamodel = Ridge()grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))grid.fit(X, y)print(grid)# summarize the results of the grid searchprint(grid.best_score_)print(grid.best_estimator_.alpha)

结果：

GridSearchCV(cv=None,

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, solver=auto, tol=0.001),

estimator__alpha=1.0, estimator__copy_X=True,

estimator__fit_intercept=True, estimator__max_iter=None,

estimator__normalize=False, estimator__solver=auto,

estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,

n_jobs=1,

param_grid={‘alpha’: array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,

1.00000e-04, 0.00000e+00])},

pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,

verbose=0)

0.282118955686

1.0

有时随机从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳的那个。

importnumpyasnpfromscipy.statsimportuniformassp_randfromsklearn.linear_modelimportRidgefromsklearn.grid_searchimportRandomizedSearchCV# prepare a uniform distribution to sample for the alpha parameterparam_grid = {'alpha': sp_rand()}# create and fit a ridge regression model, testing random alpha valuesmodel = Ridge()rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)rsearch.fit(X, y)print(rsearch)# summarize the results of the random parameter searchprint(rsearch.best_score_)print(rsearch.best_estimator_.alpha)

结果：

RandomizedSearchCV(cv=None,

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, solver=auto, tol=0.001),

estimator__alpha=1.0, estimator__copy_X=True,

estimator__fit_intercept=True, estimator__max_iter=None,

estimator__normalize=False, estimator__solver=auto,

estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,

n_jobs=1,

param_distributions={‘alpha’:

小结

我们总体了解了使用scikit-learn库的大致流程，希望这些总结能让初学者沉下心来，一步一步尽快的学习如何去解决具体的机器学习问题。

转载请注明作者Jason Ding及其出处

GitCafe博客主页(http://jasonding1354.gitcafe.io/)

Github博客主页(http://jasonding1354.github.io/)

CSDN博客(http://blog.csdn.net/jasonding1354)

简书主页(http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

最后编辑于：2017.11.27 03:11:57

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,009评论 5赞 474
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,808评论 2赞 378
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 148,891评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,283评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,285评论 5赞 363
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,409评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,809评论 3赞 393
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,487评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,680评论 1赞 295
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,499评论 2赞 318
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,548评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,268评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,815评论 3赞 304
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,872评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,102评论 1赞 258
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,683评论 2赞 348
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,253评论 2赞 341

【机器学习实验】scikit-learn的主要模块和基本使用

推荐阅读更多精彩内容