
1. 数据维度

PCA 主成分分析
principle component analysis
PCA是一套全面应用于各类数据分析的分析方法,包括特征集压缩feature set compression









6. 用于数据转换的PCA

PCA finds a new coordinate system that's obtained from the old one by translation and rotation only

PCA moves the center of the coordinate system with the center of the data

PCA move the x-axis into the principle axis of variation ,where you see the most variation relative to all the data points

PCA move the y-axis down the road into a orthogonal less important directions of variation


7. 新坐标系的中心



8. 新坐标系的主轴







归一化 PCA 成分向量后,
△x(黑)= 根号 2 分之一
△y(黑)= 根号 2 分之一 #新的x轴
△x(红)= 负根号 2 分之一
△y(红)= 根号 2 分之一 #新的y轴

11. 练习查找新轴

通过PCA还可以得出一个重要值,那就是轴的散布值 spread value

12. 哪些数据可用于PCA


Part of the beauty of PCA is that the data doesn't have to be perfectly 1D in order to find the principal axis!

13. 轴何时占主导地位

所谓长轴占优势是指轴的重要值importance value,或者说长轴特征值要大于短轴的特征值


□ 决策树分类器
□ √线性回归

15. 从四个特征到两个


square footage
no. of rooms
school ranking
neighborhood safety


16. 在保留信息的同时压缩

我们实际要调查的是size neighborhood这两个特征
□ SelectKBest(K 为要保留的特征数量)
□ √ SelectPercentile 指定你希望保留的特征的百分比

如果我们知道本来有多少个可选特征,也知道最后需要多少个特征,那么也可以使用 SelectPercentile


这里的复合特征/组合特征,也被称为主要成分principle component ,是一个非常强大的算法,本课中,我们主要在特征降维的情况中讨论它,降低特征的维度,从而将一大堆特征缩减至几个特征


例子:将square footage no.room 转化成size


18. 最大方差


  • the willingness/flexibility of an algorithm to learn
  • technical term in statistics -- roughly the 'spread' of a data distribution(similar to standard deviation)




19. 最大方差的优点

principal component of a data set is the direction that has the largest variance because ?


why do you think we define the principle component this way?
what's the advantage of looking for the direction that has the largest variance?
when we are doing our project of these two dimension feature space down on to one dimension,why do we project all the data points down onto this heavy red line instead of projecting them onto this shorter line?
□ 计算复杂度低
□ √可以最大程度保留来自原始数据的信息量
□ 只是一种惯例,并没有什么实际的原因


20. 最大方差与信息损失

safety problems + school ranking →(PCA) neighborhood quality
find the direction of maximal variance



21. 信息损失和主成分


projection onto direction of maximal variance minimizes distance from old(higher-dimensional) data point to its new transformed value
→ minimizes information loss

23. 用于特征转换的 PCA

PCA as a general algorithm for feature transformation

25. PCA 的回顾/定义

review/definition of PCA

  • systematized way to transform input features into principal component
  • use principal components as new features in regression/classification
  • you can also rank the principle components,the more variance you have of the data along a given principal component,the higher that principal component is ranked.so the one that has the most variance will be the first principal component,second will be the second principal component,and so on .
  • the principal components are all perpendicular to each other in a sense,so the second principal component is mathematically guaranteed to not overlap at all with the first principal component,and the third will not overlap with the first through the second ,and so on.so you can treat them as independent features in a sense.
  • there is a maximum number of principal components you can find,it's equal to the number of input features that you had in you data set.usually, you'll only use the first handful of principal components,but you could go all the way out and use the maximum number,in that case though,you are not really gaining anything,you're just representing your features in a different way,so the PCA won't give you the wrong answer,but it doesn't give you any advantages over just using the original input features if you're using all of the principal components together in a regression or classification task.

26. 将 PCA 应用到实际数据

在以下几段视频中,Katie 和 Sebastian 研究安然的一些财务数据,并着眼于 PCA 的应用。




28. sklearn 中的 PCA

def doPCA():
    from sklearn.decomposition import PCA
    pca = PCA(n_components = 2)
    returen pca

pca = doPCA()
print pca.explained_variance_ratio_  #方差比,是特征值的具体表现形式,可以了解第一/二个主成分占数据变动的百分比
first_pc = pca.components_[0]
second_pc = pca.components_[1]

transformed_data = pca.transform(data)
for ii,jj in zip(transformed_data,data):

29.何时使用 PCA

  • latent features driving the patterns in data(big shots at Enron)
    if you want to access to latent features that you think might be showing up in the patterns in your data,maybe the entire point of what you're trying to do is figure out if there's a latent feature,in other words,you just want to know the size of the first principal components,then measure who the big shots are at Enron.
  • dimensionality reduction
    -- visualize high dimensional data
    sometimes you will have more than two features,you have to represent three or four or many numbers about a data point if you only have two dimensions in which to draw ,and so what you can do is project it down to the first two principal components and just plot that,and just draw that scatter plot.
    -- reduce noise
    the hope is that the first or the second,your strongest principal components are capturing the actual patterns in the data,and the smaller principle components are just representing noisy variations about those patterns,so by throwing away the less important principle components,you're getting rid of that noise.
    -- make other algorithms(regression,classification) work better with fewer inputs(eigenfaces)
    using PCA as pre-processing before you use another algorithm,so a regression or a classification task,if you have very high dimensionality, and if you have a complex,say,classification algorithm,the algorithm can be very high variance,it can end up fitting to noise in the data,it can end up running really low,there are lots of things that can happen when you have very high input dimensionality with some of these algorithms,but, of course,the algorithm might work really well for the problem at hand,so one of the things you can do is use PCA to reduce the dimensionality of your input features,so that then your,say classification algorithm works better.
    in the example of eigenfaces,a method of applying PCA to pictures of people,this is a very high dimensionality space,you have many many pixels in the picture,but say,you want to identify who is pictured in the image,you are running some kind of facial identification,so with PCA you can reduce the very high input dimensionality into something that's maybe a factor of ten lower,and feed this into SVM,which can then do the actual classification of trying to figure out who's pictured,so now the inputs ,instead of being the original pixels or the images,are the principal components.

30. 用于人脸识别的PCA

PCA for facial recognition
what makes facial recognition in pictures good for PCA?
□ √pictures of faces generally have high input dimensionality (many pixels)
□ √faces have general patterns that could be captured in smaller number of dimensions(two eyes on top,mouth /chin on bottom,etc.)
□ ×facial recognition is simple using machine learning(humans do it easily)

31. 特征脸方法代码


Faces recognition example using eigenfaces and SVMs

The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:

  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

  .. _LFW: http://vis-www.cs.umass.edu/lfw/

  original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html

print __doc__

from time import time
import logging
import pylab as pl
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print "Total dataset size:"
print "n_samples: %d" % n_samples
print "n_features: %d" % n_features
print "n_classes: %d" % n_classes

# Split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150

print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train) #figuring out what the principle components are 
print "the raio is ", pca.explained_variance_ratio_ #每个主成分的可释方差  0.19346534  0.15116844
print "done in %0.3fs" % (time() - t0)

eigenfaces = pca.components_.reshape((n_components, h, w)) #asks for the eigenfaces

print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train) #transform data into the principle components representation 
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)

# Train a SVM classification model

print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
         'C': [1e3, 5e3, 1e4, 5e4, 1e5],
          'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)  #SVC using the principle components as the features
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_

# Quantitative evaluation of the model quality on the test set

print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca) #SVC try to identify in the test set who appears in a given picture.
print "done in %0.3fs" % (time() - t0)

print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))

# Qualitative evaluation of the predictions using matplotlib

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        pl.subplot(n_row, n_col, i + 1)
        pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
        pl.title(titles[i], size=12)

# plot the result of the prediction on a portion of the test set

def title(y_pred, y_test, target_names, i):
    pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
    true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
    return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, target_names, i)
                         for i in range(y_pred.shape[0])]

plot_gallery(X_test, prediction_titles, h, w)

# plot the gallery of the most significative eigenfaces

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)


The eigenfaces are basically the principle components of the face data.


at last ,the algorithm will show you the eigenfaces.

33. PCA 迷你项目

我们在讨论 PCA 时花费了大量时间来探讨理论问题,因此,在此迷你项目中,我们将要求你写一些 sklearn 代码。特征脸方法代码很有趣,而且内容丰富,足以胜任这一整个迷你项目的试验平台。

可在 pca/eigenfaces.py 中找到初始代码。此代码主要取自此处 sklearn 文档中的示例。

请注意,在运行代码时,对于在 pca/eigenfaces.py 的第 94 行调用的 SVC 函数,有一个参数有改变。对于“class_weight”参数,参数字符串“auto”对于 sklearn 版本 0.16 和更早版本是有效值,但将被 0.19 舍弃。如果运行 sklearn 版本 0.17 或更高版本,预期的参数字符串应为“balanced”。如果在运行 pca/eigenfaces.py 时收到错误或警告,请确保第 98 行包含与你安装的 sklearn 版本匹配的正确参数。

sklearn 0.16或更早版本 class_weight='auto'
sklearn 0.16或更高版本 class_weight='balanced'


我们提到 PCA 会对主成分进行排序,第一个主成分具有最大方差,第二个主成分 具有第二大方差,依此类推。第一个主成分可以解释多少方差?第二个呢?

print "the raio is ", pca.explained_variance_ratio_  #每个主成分的可释方差  0.19346534  0.15116844

第一主成分解释了多少变异量? 0.19346534
第二主成分呢? 0.15116844

我们发现,有时 Pillow 模块(本例中使用的)可能会造成麻烦。如果你收到与 fetch_lfw_people() 命令相关的错误,请尝试以下命令:

pip install --upgrade PILLOW


现在你将尝试保留不同数量的主成分。在类似这样的多类分类问题中(要应用两个以上标签),准确性这个指标不像在两个类的情形中那么直观。相反,更常用的指标是 F1 分数f1-score
我们将在评估指标课程中学习 F1 分数f1-score,但你自己要弄清楚好的分类器的特点是具有高 F1 分数f1-score还是低 F1 分数f1-score。你将通过改变主成分数量并观察 F1 分数f1-score如何相应地变化来确定。
as you add more principal components as features for training your classifier,do you expect it to get better or worse performance?
□ √ could go either way
While ideally, adding components should provide us additional signal to improve our performance, it is possible that we end up at a complexity where we overfit.

36. F1 分数与使用的主成分数

将 n_components 更改为以下值:[10, 15, 25, 50, 100, 250]。对于每个主成分,请注意 Ariel Sharon 的 F1 分数。(对于 10 个主成分,代码中的绘制功能将会失效,但你应该能够看到 F1 分数。)
如果看到较高的 F1 分数,这意味着分类器的表现是更好还是更差?

Ariel Sharon f-score
n_components = 150 f-score=0.65
n_components = 10 f-score=0.11
n_components = 15 f-score=0.33
n_components = 50 f-score=0.67
n_components = 100 f-score=0.67
n_components = 250 f-score=0.62

if you see a higher f1-score ,dose it mean the classifier is doing better,or worse?
□ √ better

37. 维度降低与过拟合

在使用大量主成分时,是否看到过拟合的任何证据?PCA 维度降低是否有助于提高性能?
did you see any evidence of overfitting when using a large number of PCs?
□ √ yes,performance starts to drop with many PCs.

38. 选择主成分

selecting a number of principle components
think about selecting how many principle components you should look at.
there is no cut and dry answer for how many principle components you should use,you kind of have to figure it out

what's a good way to figure out how many PCs to use?
□ × just take top 10%
□ √train on different number of PCs,and see how accuracy responds-cut off when it becomes apparent that adding more PCs doesn't by you much more discrimination
□ × perform feature selection on input features before putting them into PCA,then use as many PCs as you have input features.
PCA is going to find a way to combine information from potentially many different input features together,so if you are throwing out input features before you do PCA,you are throwing information that PCA might be able to kind of rescue in a sense.it's fine to do feature selection on the principle components after you have make them,but you want to be very careful about throwing out information before performing PCA.
PCA can be fairly computationally expensive,so if you have a very large input feature space and you know that a lot of them are potentially completely irrelevant features. go ahead and try tossing them out,but proceed with caution.

