多变量线性回归（一）

搭建编程环境

此处推荐安装Octave，如若已安装Matlab也可。这里不过多叙述如何安装Octave或Matlab，请自行查阅相关资料。

多维特征（Multiple Features）

之前我们学习了单变量线性回归，现在我们继续利用房价的例子来学习多变量线性回归。

如上图所示，我们对房价模型增加一些特征，例如：房间的数量、楼层数和房屋使用年限。对此，我们分别令x₁，x₂，x₃和x₄表示房屋面积、房间的数量、楼层数和房屋使用年限。

这里增添了一些特征，我们也要引入一系列新的符号标记：

n：代表特征的数量
x⁽ⁱ⁾：代表第i个训练示例，即表示特征矩阵中的第i行
x_j⁽ⁱ⁾：代表特征矩阵中第i行的第j个特征

因此，我们的多变量线性回归的表达式为：
　　h_θ(x) = θ₀+θ₁x₁+θ₂x₂+···+θ_nx_n

这个公式中有n+1个参数和n个变量，为了简化该公式，我们引入x₀=1（x₀⁽ⁱ⁾=1），则公式可以转化为：
　　h_θ(x) = θ₀x₀+θ₁x₁+θ₂x₂+···+θ_nx_n

此时公式中有n+1个参数和n+1个变量，此时我们可以将参数和变量看成n+1维的向量（即θ表示n+1维的（参数）向量，X表示n+1维的（变量）向量），则我们可将公式简化成：
　　h_θ(x) = θ^TX

补充笔记

Multiple Features

Linear regression with multiple variables is also known as "multivariate linear regression".

We now introduce notation for equations where we can have any number of input variables.

x_j⁽ⁱ⁾ = value of feature j in the i^th training example
x⁽ⁱ⁾ = the input (features) of the i^th training example

Note:

m = the number of training example
n = the number of features

The multivariable form of the hypothesis function accommodating these multiple features is as follows:
　　h_θ(x) = θ₀+θ₁x₁+θ₂x₂+···+θ_nx_n

In order to develop intuition about this function, we can think about θ₀ as the basic price of a house, θ₁ as the price per square meter, θ₂ as the price per floor, etc. x₁ will be the number of square meters in the house, x₂ the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

This is a vectorization of our hypothesis function for one training example.

多变量梯度下降（Gradient Descent For Multiple Variables）

与之前的单变量线性回归类似，我们也构建了一个代价函数J：

我们的目标与在单变量线性回归中一样，找出使得代价函数最小的一系列参数。在单变量线性回归中，我们引入梯度下降算法来找寻该参数。因此，在多变量线性回归中，我们依旧引入梯度下降算法。

即：

通过简单的求导后可得：

补充笔记

Gradient Descent for Multiple Variables

The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features:

In other words:

The following image compares image compares gradient descent with one value to gradient descent with multiple variables:

特征缩放（Feature Scaling）

在多维特征的情况下，若我们保证这些特征都具有相近的尺度，则梯度下降算法能够更快地收敛。

我们还是以房价预测为例，假设此处我们只使用两个特征，房屋的面积和房间的数量，房屋面积的取值范围为0~2000平方英尺，房间数量的取值范围为0~5。同时，我们以两个参数为横、纵坐标轴构建代价函数的等高线图。

从图中可看出，椭圆较扁，且根据图中红色线条可知，梯度下降算法需要较多次数迭代才能收敛。

因此，为了让梯度下降算法更快的收敛，我们采用特征缩放和均值归一化的方法。特征缩放通过将特征变量除以特征变量的范围（即最大值减去最小值）的方法，使得特征变量的新取值范围仅为1，即-1 ≤ x_(i) ≤1；均值归一化通过特征变量的值减去特征变量的平均值的方法，使得特征变量的新平均值为0。我们通常使用如下公式实现特征缩放和均值归一化：

其中μ_n表示某一特征的平均值，s_n表示某一特征的标准差（或最大值与最小值间的差，即max-min）。

补充笔记

Feature Scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
　　-1 ≤ x_(i) ≤1
or
　　-0.5 ≤ x_(i) ≤0.5

These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum values) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

Where μ_i is the average of all the values for feature (i) and s_i is the range of values (max - min), or s_i is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results.

学习率α

梯度下降算法收敛所需要的迭代次数根据模型的不同而不同。实际上，我们很难提前判断梯度下降算法需要多少步迭代才能收敛。对此，我们通常画出代价函数随着迭代步数增加的变化曲线来试着预测梯度下降算法是否已经收敛。

同时，这是种方法也可以进行一些自动收敛测试。（注：自动收敛测试就是用一种算法来判断梯度下降算法是否收敛，通常要选择一个合理的阈值ε来与代价函数J(θ)的下降的幅度比较，如若代价函数J(θ)的下降的幅度小于这个阈值ε，则可判断梯度下降算法已经收敛。但这个阈值ε的选择是非常困难的，因此我们实际上还是通过观察曲线图来判断梯度下降算法是否收敛。）

梯度下降算法的每次迭代都要受到学习率α的影响，当学习率α过小时，则梯度下降算法要进行很多次迭代才能收敛；当学习率α过大时，则梯度下降算法可能就会出错，即每次迭代，代价函数可能不会下降，并可能越过局部最小值导致无法收敛。

补充笔记

Learning Rate

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increase, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decrease by less than E in one iteration, where E is some small value such as 10^-3. However in practice it's difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

if α is too small: slow convergence.
if α is too large: may not decrease on every iteration and thus may not converge.

特征和多项式回归（Features and Polynomial Regression）

之前我们介绍了多变量的线性回归，现在我们来学习一下多项式回归，其能帮助我们使用线性回归的方法来拟合非常复杂的函数，甚至是非线性函数。

比如有时我们想使用二次方模型（h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂²）来拟合我们的数据，又有时我们想使用三次方模型（h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂² + θ₃x₃³）来拟合我们的数据······

通常我们需要先观察数据然后来决定参数使用什么样的模型。

另外，我们可以令：

x₂ = x₂²
x₃ = x₃³
······

这样我们就将这些多项式回归模型又转换为线性回归模型。（注：我们在使用多项式回归模型时，由于会对变量x_i进行平方、立方等操作，因此我们有必要在运行梯度下降算法之前进行特征缩放。）

补充笔记

Features and Polynomial Regression

We can improve our features and the form of our hypothesis function in a couple different ways.

We can combine multiple features into one. For example, we can combine x₁ and x₂ into a new feature x₃ by taking x₁ * x₂.

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

For example, if our hypothesis function is h_θ(x) = θ₀ + θ₁x₁ then we can create additional features based on x₁, to get the quadratic function h_θ(x) = θ₀ + θ₁x₁ + θ₂x₁² or the cubic function h_θ(x) = θ₀ + θ₁x₁ + θ₂x₁² + θ₃x₁³

In the cubic version, we have created new features x₂ = x₁² and x₃ = x₁³.

To make it a square root function, we could do:

One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.

eg. if x₁ has range 1~1000 then range of x₁² becomes 1~1000000 and that of x₁³ becomes 1~1000000000

最后编辑于：2017.12.10 00:36:00

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,088评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,715评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,361评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,099评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 60,987评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,063评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,486评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,175评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,440评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,518评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,305评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,190评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,550评论 3赞 298
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,880评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,152评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,451评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,637评论 2赞 335

多变量线性回归（一）

搭建编程环境

多维特征（Multiple Features）

补充笔记

Multiple Features

多变量梯度下降（Gradient Descent For Multiple Variables）

补充笔记

Gradient Descent for Multiple Variables

特征缩放（Feature Scaling）

补充笔记

Feature Scaling

学习率α

补充笔记

Learning Rate

特征和多项式回归（Features and Polynomial Regression）

补充笔记

Features and Polynomial Regression

推荐阅读更多精彩内容