sklearn.model_selection.learning_curve

本文是对scikit-learn.org上函数说明<learning_curve>一文的翻译。
包括其引用的用户手册-learning_curve

函数签名Signature:

learning_curve(
    estimator,
    X,
    y,
    *,
    groups=None,
    train_sizes=array([0.1  , 0.325, 0.55 , 0.775, 1.   ]),
    cv=None,
    scoring=None,
    exploit_incremental_learning=False,
    n_jobs=None,
    pre_dispatch='all',
    verbose=0,
    shuffle=False,
    random_state=None,
    error_score=nan,
    return_times=False,
    fit_params=None,
)

简介

学习曲线。

求出不同的训练集大小的交叉验证的训练和测试分数

一个交叉验证的生成器把整个数据集拆分训练数据和测试数据k次。不同大小的训练集的子集将被用来训练estimator,并计算每次训练子集的分数。之后,the scores will be averaged over all k runs for each training subset size.

更多信息参考用户指南

一并翻译如下:

3.4. Validation curves: plotting scores to evaluate models

3.4 验证曲线:绘制估计模型的参数

每个estimator都有它自己的优势和缺点。它的泛化误差能分解成偏差,方差和噪声。Estimator的偏差是他在不同训练集上的平均误差。Estimator的方差表示它对不同训练集有多敏感。噪声是数据本身的性质。

下面的图表中,我们看到一个函数(f(x) = \cos (\frac{3}{2} \pi x))和这个来自函数的噪声样例。下面用三个不同的estimators来fit这个函数:1,4和15度的多项式特征的线性回归。第一个estimator最好情况下也只能在样本和真实函数之间提供一个很差的适应,因为它太简单了(高偏差),第二个estimator接近完美,最后一个estimator完美贴合了训练数据但是没有很好的适应真实函数,也就是说它对不同的训练数据非常敏感(高方差)。


image.png
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


def true_fun(X):
    return np.cos(1.5 * np.pi * X)


np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(
        pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
    )

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(
        "Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
            degrees[i], -scores.mean(), scores.std()
        )
    )
plt.show()

偏差和方差是estimator的固有性质,我们经常需要选择学习算法的超参数来让偏差和方差尽可能的低(see Bias-variance dilemma
)。减小模型方差的另一种方式使用更多的训练集。不过,你只有在true function太过复杂以至于不能用一个低方差的estimator很好的逼近的时候才需要采集更多的数据

在我们上面举的这个简单一维问题来说,显然无论这个estimator遭受多少偏差和方差。在高维空间中,模型都变得很难可视化。正是这个原因,通常使用下面描述的工具。

3.4.1 验证曲线

验证模型需要用到评分函数 (see Metrics and scoring: quantifying the quality of predictions),例如分类器的准确度。选择一个estimator的多个超参数的合适的方法当然是网格搜索或者类似的方法 (see Tuning the hyper-parameters of an estimator) ,也就是在一个或者多个验证集中选择分数最高的超参数。
注意我们优化超参数基于的验证分数有了偏差以及估计的泛化不再优秀了。为了获得更强的泛化能力需要在另外的测试集上计算分数。

不过,对于一些超参数值来说,有时候把单个超参数对于训练分数和验证分数的影响画出来对于找出estimator是否过拟合还是欠拟合很有帮助。

这种情况下 validation_curve 就很有帮助了

>>> import numpy as np
>>> from sklearn.model_selection import validation_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import Ridge

>>> np.random.seed(0)
>>> X, y = load_iris(return_X_y=True)
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]

>>> train_scores, valid_scores = validation_curve(
...     Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3),
...     cv=5)
>>> train_scores
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
       [0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
       [0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores
array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
       [0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
       [0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])

如果训练分数和验证分数都很低,这个estimator就是欠拟合的,如果训练分数很高,验证分数很低,这个estimator就是过拟合的,不然它就是非常有效得。训练分数很低,验证分数很高通常不可能。下面图表中是使用digits数据集的一个SVM,在不同(\gamma)\参数下的欠拟合,过拟合和有效的模型。


image.png

代码:

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

X, y = load_digits(return_X_y=True)
subset_mask = np.isin(y, [1, 2])  # binary classification: 1 vs 2
X, y = X[subset_mask], y[subset_mask]

param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
    SVC(),
    X,
    y,
    param_name="gamma",
    param_range=param_range,
    scoring="accuracy",
    n_jobs=2,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(
    param_range, train_scores_mean, label="Training score", color="darkorange", lw=lw
)
plt.fill_between(
    param_range,
    train_scores_mean - train_scores_std,
    train_scores_mean + train_scores_std,
    alpha=0.2,
    color="darkorange",
    lw=lw,
)
plt.semilogx(
    param_range, test_scores_mean, label="Cross-validation score", color="navy", lw=lw
)
plt.fill_between(
    param_range,
    test_scores_mean - test_scores_std,
    test_scores_mean + test_scores_std,
    alpha=0.2,
    color="navy",
    lw=lw,
)
plt.legend(loc="best")
plt.show()

3.4.2 学习曲线

学习曲线展示了estimator在改变训练样本数量时的验证和训练分数。是一个找出增加训练集究竟有多大程度优化,和estimator是否受到方差或者偏差的影响的工具。考虑下面的例子:画出来的朴素贝叶斯和SVM的学习曲线。

朴素贝叶斯中,随着训练集的加大,验证分数和训练分数汇聚到一个很低的值。这样,增加训练集数据可能没多少优化了。

相反,同样数量的数据,SVM的训练分数比验证分数高很多。增加训练样本能够增加泛化能力。

image.png

使用 learning_curve来生成我们需要在学习曲线中画出来的值(已经使用过的样例的数量,训练集的平均分数,以及验证集的平均分数)

>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC

>>> train_sizes, train_scores, valid_scores = learning_curve(
...     SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
       [0.98..., 1.   , 0.98..., 0.98..., 0.98...],
       [0.98..., 1.   , 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[1. ,  0.93...,  1. ,  1. ,  0.96...],
       [1. ,  0.96...,  1. ,  1. ,  0.96...],
       [1. ,  0.96...,  1. ,  1. ,  0.96...]])

Parameters

estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

X : array-like of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
Target relative to X for classification or regression;
None for unsupervised learning.

groups : array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into
train/test set. Only used in conjunction with a "Group" :term:cv
instance (e.g., :class:GroupKFold).

train_sizes : array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5)
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the dtype is float, it is regarded as a
fraction of the maximum size of the training set (that is determined
by the selected validation method), i.e. it has to be within (0, 1].
Otherwise it is interpreted as absolute sizes of the training sets.
Note that for classification the number of samples usually have to
be big enough to contain at least one sample from each class.

cv : int, cross-validation generator or an iterable, default=None
Determines the cross-validation splitting strategy.
Possible inputs for cv are:

- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used. These splitters are instantiated
with `shuffle=False` so the splits will be the same across calls.

Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.

.. versionchanged:: 0.22
    ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : str or callable, default=None
A str (see model evaluation documentation) or
a scorer callable object / function with signature
scorer(estimator, X, y).

exploit_incremental_learning : bool, default=False
If the estimator supports incremental learning, this will be
used to speed up fitting for different training set sizes.

n_jobs : int, default=None
Number of jobs to run in parallel. Training the estimator and computing
the score are parallelized over the different training and test sets.
None means 1 unless in a :obj:joblib.parallel_backend context.
-1 means using all processors. See :term:Glossary <n_jobs>
for more details.

pre_dispatch : int or str, default='all'
Number of predispatched jobs for parallel execution (default is
all). The option can reduce the allocated memory. The str can
be an expression like '2*n_jobs'.

verbose : int, default=0
Controls the verbosity: the higher, the more messages.

shuffle : bool, default=False
Whether to shuffle training data before taking prefixes of it
based ontrain_sizes.

random_state : int, RandomState instance or None, default=None
Used when shuffle is True. Pass an int for reproducible
output across multiple function calls.
See :term:Glossary <random_state>.

error_score : 'raise' or numeric, default=np.nan
Value to assign to the score if an error occurs in estimator fitting.
If set to 'raise', the error is raised.
If a numeric value is given, FitFailedWarning is raised.

.. versionadded:: 0.20

return_times : bool, default=False
Whether to return the fit and score times.

fit_params : dict, default=None
Parameters to pass to the fit method of the estimator.

.. versionadded:: 0.24

Returns

train_sizes_abs : array of shape (n_unique_ticks,)
Numbers of training examples that has been used to generate the
learning curve. Note that the number of ticks might be less
than n_ticks because duplicate entries will be removed.

train_scores : array of shape (n_ticks, n_cv_folds)
Scores on training sets.

test_scores : array of shape (n_ticks, n_cv_folds)
Scores on test set.

fit_times : array of shape (n_ticks, n_cv_folds)
Times spent for fitting in seconds. Only present if return_times
is True.

score_times : array of shape (n_ticks, n_cv_folds)
Times spent for scoring in seconds. Only present if return_times
is True.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 199,519评论 5 468
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 83,842评论 2 376
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 146,544评论 0 330
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 53,742评论 1 271
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 62,646评论 5 359
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,027评论 1 275
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,513评论 3 390
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,169评论 0 254
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,324评论 1 294
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,268评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,299评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 32,996评论 3 315
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,591评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,667评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 30,911评论 1 255
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,288评论 2 345
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 41,871评论 2 341

推荐阅读更多精彩内容