Lesson 8 - 抽样分布与中心极限定理

概述

请注意我们本来要找的是什么,我们要找的是特定样本在样本均值分布的什么位置,不仅仅针对是这个简单的总体更是针对庞大的总体。

现在可以找到了因为现在我们知道对于均值分布,其中每个均值都是样本量为 n 的均值,该分布的标准偏差就等于总体标准偏差除以平方根 n,这就叫做中心极限定理

它不仅适用于这些简单的总体,更是适用于任何总体。正是因为中心极限定理,我们的总体可以是任何形状

假设我们从中抽取一个样本并计算出均值,然后再抽取出一个样本并计算出均值,持续这么操作。

如果画出均值分布图的话,形状会是相对正态的,其中标准偏差等于总体标准偏差除以样本量的平方根我们一直都叫它 SE

02-Video: Descriptive vs. Inferential Statistics

In this section, we learned about how Inferential Statistics differs from Descriptive Statistics.


Descriptive Statistics

Descriptive statistics is about describing our collected data.


Inferential Statistics

Inferential Statistics is about using our collected data to draw conclusions to a larger population.

image.png

We looked at specific examples that allowed us to identify the

  1. Population - our entire group of interest.
  2. Parameter - numeric summary about a population
  3. Sample - subset of the population
  4. Statistic numeric summary about a sample

05-Text: Descriptive vs. Inferential Statistics\

Descriptive vs. Inferential Statistics

In this section, we learned about how Inferential Statistics differs from Descriptive Statistics.

Descriptive Statistics

Descriptive statistics is about describing our collected data using the measures discussed throughout this lesson: measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.

Inferential Statistics

Inferential Statistics is about using our collected data to draw conclusions to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.

A common way to collect data is via a survey. However, surveys may be extremely biased depending on the types of questions that are asked, and the way the questions are asked. This is a topic you should think about when tackling the the first project.

We looked at specific examples that allowed us to identify the

  1. Population - our entire group of interest.
  2. Parameter - numeric summary about a population
  3. Sample - subset of the population
  4. Statistic numeric summary about a sample

10-Text: Sampling Distribution Notes

Sampling Distributions Notes(抽样分布时统计值的分布)

We have already learned some really valuable ideas about sampling distributions:


First, we have defined sampling distributions as the distribution of a statistic.

This is fundamental - I cannot stress the importance of this idea. We simulated the creation of sampling distributions in the previous ipython notebook for samples of size 5 and size 20, which is something you will do more than once in the upcoming concepts and lessons.

选择不同的组合统计量会不相同

image.png

如果选择所有的组合将会出现下面的结果

image.png

如果将不同的组合产生的统计量进行绘图可得

image.png

以上的分布就为抽样分布


Second, we found out some interesting ideas about sampling distributions that will be iterated later in this lesson as well. We found that for proportions (and also means, as proportions are just the mean of 1 and 0 values), the following characteristics hold.

  • The sampling distribution is centered on the original parameter value.

  • The sampling distribution decreases its variance depending on the sample size used. Specifically, the variance of the sampling distribution is equal to the variance of the original data divided by the sample size used(抽样分布的方差等于原始数据的方差除以样本量). This is always true for the variance of a sample mean!

image.png

样本均值的抽样分布图, 其方差为σ平方(原始数据的)除以样本量

练习

image.png

Looking Ahead

The rest of this lesson will reinforce some of these ideas that you saw at work in this notebook, but you are already being introduced to some big ideas that will continue to show up again and again.

12-Video: Notation for Parameters vs. Statistics

image.png
image.png

As you saw in this video, we commonly use Greek symbols as parameters and lowercase letters as the corresponding statistics. Sometimes in the literature, you might also see the same Greek symbols with a "hat" to represent that this is an estimate of the corresponding parameter.

Below is a table that provides some of the most common parameters and corresponding statistics, as shown in the video.

Remember that all parameters pertain to a population, while all statistics pertain to a sample.

image.png

注意

总体参数不会因样本的不同发生变化, 只有统计量会因样本的不同而不同.

15-Video: Two Useful Theorems - Law of Large Numbers

Two important mathematical theorems for working with sampling distributions include:

image.png
  1. Law of Large Numbers(大数定理)
  2. Central Limit Theorem(中心极限定理)

The Law of Large Numbers says that as our sample size increases, the sample mean gets closer to the population mean, but how did we determine that the sample mean would estimate a population mean in the first place? How would we identify another relationship between parameter and statistic like this in the future?


Three of the most common ways are with the following estimation techniques:

Though these are beyond the scope of what is covered in this course, these are techniques that should be well understood for Data Scientist's that may need to understand how to estimate some value that isn't as common as a mean or variance. Using one of these methods to determine a "best estimate", would be a necessity.

17-Video: Two Useful Theorems - Central Limit Theorem

The Central Limit Theorem states that with a large enough sample size the sampling distribution of the mean will be normally distributed.

The Central Limit Theorem actually applies for these well known statistics:

image.png

And it applies for additional statistics, but it doesn't apply for all statistics! . You will see more on this towards the end of this lesson.

20-Video: When Does the Central Limit Theorem Not Work?

In the previous example, you saw how the Central Limit Theorem applies to the sample mean of 100 draws from a right-skewed distribution. However, it did not apply to a sample size of 3 draws from this same distribution.(并不适用所有的抽样分布)

适用于:


image.png

不适用于:


image.png

In the next concepts, you will see that the with large sample sizes the sampling distribution of certain statistics will never become normally distributed. So how do we know which statistics will follow normal distributions, and which will not?

So, you might be wondering already why is the Central Limit Theorem such a big deal? In our new age of computers, it probably isn't as big of a deal, but more on this coming up soon!

22-Video: Bootstrapping(自助法)

Bootstrapping is sampling with replacement.(已放回方式进行抽样, 也就是说被抽取的个体有可能在下一次接着被抽到, 也有可能被一直抽到, 但是这个可能性非常小)Using random.choice in python actually samples in this way. Where the probability of any number in our set stays the same regardless of how many times it has been chosen. Flipping a coin and rolling a die are kind of like bootstrap sampling as well, as rolling a 6 in one scenario doesn't mean that 6 is less likely later.

23-Video: Bootstrapping & The Central Limit Theorem

image.png

在推论统计学中, 使用统计量去推断总体参数, 假设我们让样本当作一个总体, 上图中的21个杯子, 虽然只有总体的一个样本, 但是假设它们是总体, 可以从中对其进行自助抽样, 在一个样本和另一个样本之间, 喝咖啡的人之间比例有什么变化.

image.png

从上图中可以看出, 两次的均值不同, 因为第二次虽然还是21个样本数, 但是每一个个体都是从新从原始的21个个体中抽取.

You actually have been bootstrapping to create sampling distributions in earlier parts of this lesson, but this can be extended to a bigger idea.

It turns out, we can do a pretty good job of finding out where a parameter is by using a sampling distribution created from bootstrapping from only a sample. This will be covered in depth in the next lessons.

Three of the most common ways are with the following estimation techniques for finding "good statistics" are as shown previously:

Though these are beyond the scope of what is covered in this course, these are techniques that should be well understood for data scientists who may need to understand how to estimate some value that isn't as common as a mean or variance. Using one of these methods to determine a "best estimate" would be a necessity.

25-Video: The Background of Bootstrapping

Two helpful links:

  • You can learn more about Bradley Efron here.

  • Additional notes on why bootstrapping works as a technique for inference can be found here.

26-Video: Why are Sampling Distributions Important

27-Quiz + Text: Recap & Next Steps

Recap

In this lesson, you have learned a ton! You learned:

image.png

Sampling Distributions

  • Sampling Distributions are the distribution of a statistic (any statistic).

  • There are two very important mathematical theorems that are related to sampling distributions: The Law of Large Numbers and The Central Limit Theorem.

  • The Law of Large Numbers states that as a sample size increases, the sample mean will get closer to the population mean. In general, if our statistic is a "good" estimate of a parameter, it will approach our parameter with larger sample sizes.

  • The Central Limit Theorem states that with large enough sample sizes our sample mean will follow a normal distribution, but it turns out this is true for more than just the sample mean.


Bootstrapping

  • Bootstrapping is a technique where we sample from a group with replacement.

  • We can use bootstrapping to simulate the creation of sampling distribution, which you did many times in this lesson.

  • By bootstrapping and then calculating repeated values of our statistics, we can gain an understanding of the sampling distribution of our statistics.


Looking Ahead

In this lesson you gained the fundamental ideas that will help you with the next two lessons by learning about sampling distributions and bootstrapping. These are going provide the basis for confidence intervals and hypothesis testing in the next two lessons.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 201,552评论 5 474
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,666评论 2 377
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 148,519评论 0 334
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,180评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,205评论 5 363
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,344评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,781评论 3 393
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,449评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,635评论 1 295
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,467评论 2 317
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,515评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,217评论 3 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,775评论 3 303
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,851评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,084评论 1 258
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,637评论 2 348
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,204评论 2 341

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 7,277评论 0 10
  • 单身人士大概都会遭遇到被父母好友,街坊邻居,各方亲戚等追问何时脱单,,“你有对象没有?”“有喜欢的人了吗?”“...
    染墨葙阅读 218评论 0 1
  • 这几天生活中一直在看我的大树,累时看,有力量;动摇时看,很坚定。我很爱它。回想那天在武夷山寻到它时,它安安静静的站...
    心宽者阅读 133评论 0 1
  • “ 是想也是像。”
    空集一原阅读 124评论 0 0
  • 想我的猫了 朋友家被嫌弃的猫,来到了我家,这是第一天,我们怀着忐忑的心情等待他吃饭拉屎,用来确认他的意愿,宽慰大家...
    猫本猫的猫和猫阅读 196评论 1 0