这几天在优达Udacity学了用R做数据分析,以前也学过,不过没有学得这么系统,把今天学的过程和作业贴在这里。有兴趣的同学可以点击链接去听课
连续三天每天3个多小时,终于把核心内容学完了!现在对编R语言也有点信心了。其实工具都只是用来辅助自己的,最终数据分析的功夫还是在于基础原理。就像一个人无论用Markdown、word、有道云、还是记事本写文章,最终决定文章水平的还是写作水平。所以学会这些软件以后不能自我感动,还是多花时间提升底层的理论水平比较重要。工具毕竟是工具,最终决定能不能找到工作的还是对于业务的理解。
Lesson 5
Multivariate Data
Notes:
Moira Perceived Audience Size Colored by Age
Notes:
Third Qualitative Variable
Notes:
pf <- read.delim('pseudo_facebook.tsv')
library(ggplot2)
ggplot(aes(x = gender, y = age),
data = subset(pf, !is.na(gender))) +
geom_boxplot()+
stat_summary(fun.y=mean,geom="point",shape=4,color="blue")
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color=gender),stat='summary',fun.y=median)
Plotting Conditional Summaries
Notes:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#chain fuctions together %>%
pf.fc_by_age_gender <- pf%>%
filter(!is.na(gender))%>%
group_by(age,gender)%>%
summarise(fri_mean=mean(friend_count),
fri_median=median(friend_count))%>%
arrange(age)
ggplot(aes(x=age,y=fri_median,color=gender),
data=subset(pf.fc_by_age_gender,!is.na(gender)))+
geom_line()
Thinking in Ratios
Notes:
Wide and Long Format
Notes:
Reshaping Data
Notes:
library(reshape2)
pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
age ~ gender,
value.var='fri_median')
head(pf.fc_by_age_gender.wide)
## age female male
## 1 13 148.0 55.0
## 2 14 224.0 92.5
## 3 15 276.0 106.5
## 4 16 258.5 136.0
## 5 17 245.5 125.0
## 6 18 243.0 122.0
Ratio Plot
Notes:
ggplot(aes(x=age,y=female/male),
data=pf.fc_by_age_gender.wide)+
geom_line()+
geom_hline(yintercept=1,alpha=.3,linetype=2)
Third Quantitative Variable
Notes:
pf$year_joined <- floor(2014 -pf$tenure/365)
summary(pf$year_joined)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2005 2012 2012 2012 2013 2014 2
table(pf$year_joined)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 9 15 581 1507 4557 5448 9860 33366 43588 70
Cut a Variable
Notes:
pf$year_joined.bucket <- cut(pf$year_joined,
c(2004,2009,2011,2012,2014))
Plotting it All Together
Notes:
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender)&!is.na(year_joined.bucket))) +
geom_line(aes(color=year_joined.bucket),stat='summary',fun.y=median)
Plot the Grand Mean
Notes:
ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender)&!is.na(year_joined.bucket))) +
geom_line(aes(color=year_joined.bucket),stat='summary',fun.y=mean)+
geom_line(linetype=2,stat='summary',fun.y=mean)
Friending Rate
Notes:
with(subset(pf,tenure>=1),summary(friend_count/tenure))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0775 0.2205 0.6096 0.5658 417.0000
Friendships Initiated
Notes:
What is the median friend rate?
What is the maximum friend rate?
ggplot(aes(y=friendships_initiated/tenure,
x=tenure),
data=subset(pf,tenure>=1))+
geom_line(aes(color=year_joined.bucket))
Bias-Variance Tradeoff Revisited
Notes:
q1 <- ggplot(aes(x = tenure, y = friendships_initiated / tenure),
data = subset(pf, tenure >= 1)) +
geom_line(aes(color = year_joined.bucket),
stat = 'summary',
fun.y = mean)+
geom_smooth()
q2<- ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
q3<- ggplot(aes(x = 30 * round(tenure / 30), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
q4<- ggplot(aes(x = 90 * round(tenure / 90), y = friendships_initiated / tenure),
data = subset(pf, tenure > 0)) +
geom_line(aes(color = year_joined.bucket),
stat = "summary",
fun.y = mean)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(q1,q2,q3,q4,ncol=1)
## `geom_smooth()` using method = 'gam'
Sean’s NFL Fan Sentiment Study
Notes:
Introducing the Yogurt Data Set
Notes:
Histograms Revisited
Notes:
yo<-read.csv('yogurt.csv')
str(yo)
## 'data.frame': 2380 obs. of 9 variables:
## $ obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
## $ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
## $ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
## $ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
## $ plain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : num 59 59 65 65 49 ...
yo$id<-factor(yo$id)
qplot(x=price,data=yo,fill=I('#F79420'))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Number of Purchases
Notes:
summary(yo)
## obs id time strawberry
## Min. : 1.0 2132290: 74 Min. : 9662 Min. : 0.0000
## 1st Qu.: 696.5 2130583: 59 1st Qu.: 9843 1st Qu.: 0.0000
## Median :1369.5 2124073: 50 Median :10045 Median : 0.0000
## Mean :1367.8 2149500: 50 Mean :10050 Mean : 0.6492
## 3rd Qu.:2044.2 2101790: 47 3rd Qu.:10255 3rd Qu.: 1.0000
## Max. :2743.0 2129528: 39 Max. :10459 Max. :11.0000
## (Other):2061
## blueberry pina.colada plain mixed.berry
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3571 Mean : 0.3584 Mean :0.2176 Mean :0.3887
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :12.0000 Max. :10.0000 Max. :6.0000 Max. :8.0000
##
## price
## Min. :20.00
## 1st Qu.:50.00
## Median :65.04
## Mean :59.25
## 3rd Qu.:68.96
## Max. :68.96
##
length(unique(yo$price))
## [1] 20
table(yo$price)
##
## 20 24.96 33.04 33.2 33.28 33.36 33.52 39.04 44 45.04 48.96 49.52
## 2 11 54 1 1 22 1 234 21 11 81 1
## 49.6 50 55.04 58.96 62 63.04 65.04 68.96
## 1 205 6 303 15 2 799 609
yo <- transform(yo,all.purchases = strawberry+blueberry+pina.colada+plain+mixed.berry)
summary(yo)
## obs id time strawberry
## Min. : 1.0 2132290: 74 Min. : 9662 Min. : 0.0000
## 1st Qu.: 696.5 2130583: 59 1st Qu.: 9843 1st Qu.: 0.0000
## Median :1369.5 2124073: 50 Median :10045 Median : 0.0000
## Mean :1367.8 2149500: 50 Mean :10050 Mean : 0.6492
## 3rd Qu.:2044.2 2101790: 47 3rd Qu.:10255 3rd Qu.: 1.0000
## Max. :2743.0 2129528: 39 Max. :10459 Max. :11.0000
## (Other):2061
## blueberry pina.colada plain mixed.berry
## Min. : 0.0000 Min. : 0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median : 0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3571 Mean : 0.3584 Mean :0.2176 Mean :0.3887
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :12.0000 Max. :10.0000 Max. :6.0000 Max. :8.0000
##
## price all.purchases
## Min. :20.00 Min. : 1.000
## 1st Qu.:50.00 1st Qu.: 1.000
## Median :65.04 Median : 2.000
## Mean :59.25 Mean : 1.971
## 3rd Qu.:68.96 3rd Qu.: 2.000
## Max. :68.96 Max. :21.000
##
Prices over Time
Notes:
ggplot(aes(x=time,y=price),data=yo)+
geom_jitter(alpha=1/5)
Sampling Observations
Notes:
Looking at Samples of Households
set.seed(4230)
sample.ids <- sample(levels(yo$id),16)
ggplot(aes(x=time,y=price),data=subset(yo,id %in% sample.ids))+
facet_wrap(~ id) +
geom_line()+
geom_point(aes(size=all.purchases),pch=1)
The Limits of Cross Sectional Data
Notes:
Many Variables
Notes:
Scatterplot Matrix
Notes:
Even More Variables
Notes:
Heat Maps
Notes:
nci <- read.table("nci.tsv")
colnames(nci) <- c(1:64)
nci.long.samp <- melt(as.matrix(nci[1:200,]))
names(nci.long.samp) <- c("gene", "case", "value")
head(nci.long.samp)
## gene case value
## 1 1 1 0.300
## 2 2 1 1.180
## 3 3 1 0.550
## 4 4 1 1.140
## 5 5 1 -0.265
## 6 6 1 -0.070
ggplot(aes(y = gene, x = case, fill = value),
data = nci.long.samp) +
geom_tile() +
scale_fill_gradientn(colours = colorRampPalette(c("blue", "red"))(100))
Analyzing Three of More Variables
Reflection:
Click
KnitHTML
to see all of your hard work and to have an html page of this lesson, your answers, and your notes!