使用R语言creditmodel包进行Vintage分析或留存率分析

1 什么是vintage分析？

Vintage分析（账龄分析法）被广泛应用于信用卡及信贷行业，这个概念起源于葡萄酒，即不同年份出产的葡萄酒的品质有差异，那么不同时期开户或者放款的资产质量也有差异，其核心在于，对不同时期不同批次的资产分别跟踪，按照账龄同步对比，从而能够了解不同时期放款或发行信用卡的资产质量情况。

vintage分析从更广泛的意义来讲属于同期群分析，跟社会跟踪调查、人口学的队列分析技术，互联网运营的留存分析是类似的，具体概念不再赘述。我们直接进入主题，如何使用R语言creditmodel包做Vintage分析。

2 creditmodel包的cohort analysis模块简介

creditmodel是汉森老师开发的一个强大的R语言数据科学工具包，有数据预处理、变量衍生、数据分析、数据可视化、自动化建模五大功能模块。而今天所讲的vintage分析是creditmodel包数据分析模块的一个子模块，包括cohort_analysis、cohort_table、cohort_table_plot、cohort_plot四个主要函数。

3 cohort analysis 模块简介

Description

cohort_analysis cohort_analysis is for cohort(vintage) analysis.

Usage

cohort_analysis(dat,obs_id=NULL,occur_time=NULL,MOB=NULL,

period="monthly",status=NULL,amount=NULL,by_out="cnt",

start_date=NULL,end_date=NULL,dead_status=30)

cohort_table(dat,obs_id=NULL,occur_time=NULL,MOB=NULL,

period="monthly",status=NULL,amount=NULL,by_out="cnt",

start_date=NULL,end_date=NULL,dead_status=30)

Arguments

datA data.frame contained id, occur_time, mob, status ...

obs_idThe name of ID of observations or key variable of data. Default is NULL.

occur_timeThe name of the variable that represents the time at which each observation takes place.

MOBMobility of book

periodPeriod of event to analysis. Default is "monthly"

statusStatus of observations

amountThe name of variable representing amount. Default is NULL.

by_outOutput: amount (amt) or count (cnt)

start_dateThe earliest occurrence time of observations.

end_dateThe latest occurrence time of observations.

dead_statusStatus of dead observations.

4 使用vintage分析步骤

4.1 数据准备

进行vintage分析，输入的数据至少要有放款编号（loan_id）, 放款时间(loan_time)、放款金额(loan_amount)和账户状态(max_overdue_days或age_overdue_days)四列。

#安装和加载creditmodel包

#install.packages("creditmodel")

library(creditmodel)

#使用read_data读入数据。

vin_dat=read_data("vin_dat.csv")

#使用creditmodel包的数据清晰模块主函数对数据进行清洗，关于数据清洗模块，以后会做详细接受，在此简单描述下各个参数的含义。

vin_dat=data_cleansing(vin_dat,obs_id="loan_id",#主键

occur_time='loan_time',#事件发生时间

outlier_proc=FALSE,#不进行异常值处理

missing_proc=FALSE,#不进行确实值处理

remove_dup=FALSE,#不删除重复观测

merge_cat=FALSE,#不对类别变量的类别进行合并

low_var=0.9999,#删除单一值比例大于0.9999的变量

missing_rate=0.9999# 对缺失值比例大于0.9999的变量进行二值化处理

)

#可使用creditmodel包的data_exploration函数来观察数据概貌

data_exploration(vin_dat)

#使用plot_table画出数值型变量的数据概要

plot_table(data_exploration(vin_dat)$num)

4.2 vintage分析

4.2.1 cohort_dat表的构建

使用cohort_analysis函数来构建cohort_dat表。

cohort_dat=cohort_analysis(vin_dat,

obs_id='loan_id',#放款编号

occur_time='loan_time',#放款时间

MOB=NULL,#month on book在账月份，找个可以自己定义为一个变量，默认以自然月为月份。

period='monthly',#以月作为同一时期，也可按周weekly

status="age_overdue_days",#使用账龄末逾期天数作为状态，也为自己定义的0、1变量

dead_status=30,#逾期天数大于30天则为dead状态，若为0、1变量，此处应设为0.

amount="loan_amount",#如果以金额统计，则必须设置，此处按放款金额计算,也可以按余额

by_out='amt',#如果以金额统计则为‘amt’,以笔数统计则为‘cnt’

start_date="2016-08-01",#统计日开始时间

end_date='2017-05-31'#统计日结束时间

)

最终表结构如下：

GroupAgeTotalEventsOpeningfinal_EventsCurrent_rateEvents_rateRetention_rate

2016/8/1064766471010.15610.00931

2016/8/11646216471010.15610.03250.9985

2016/8/12645216471010.15610.03250.9969

2016/8/13642256471010.15610.03860.9923

2016/8/14638336471010.15610.0510.9861

2016/8/15630476471010.15610.07260.9737

4.2.2 画出vintage图

画出vintage图，特别简单，直接使用cohort_plot函数，输入上一步计算的cohort_dat即可。

cohort_plot(cohort_dat)

4.2.3 vintage表格

使用cohort_table函数得到vintage表格，其入参与cohort_analysis 入参完全一致。

vin_table=cohort_table(vin_dat,obs_id='loan_id',occur_time='loan_time',MOB=NULL,

period='monthly',status="max_overdue_days",

dead_status=30,amount="loan_balance",by_out='amt',

start_date="2016-09-01",end_date='2017-07-31')

最终表格如下表所示：

Cohort_Group123456789101112

2016/9/10%0.48%1.19%2.09%2.78%4.11%4.46%5.45%6.06%7.67%8.43%9.47%

2016/10/10%0.34%1.37%2.61%4.03%4.50%6.27%7.04%8.31%9.19%10.70%

2016/11/10%0.47%1.39%2.44%3.44%4.96%5.97%6.91%7.93%9.23%

2016/12/10%0.39%1.37%2.07%3.25%4.29%4.60%6.18%7.39%

2017/1/10%0.32%0.82%1.83%2.66%3.21%4.71%5.31%

2017/2/10%0.35%1.25%2.49%2.97%5.09%7.03%

2017/3/10%0.60%1.21%1.63%4.07%5.55%

2017/4/10%0.16%0.83%2.81%4.52%

2017/5/10%0.29%1.26%1.87%

2017/6/10%0.47%1.27%

2017/7/10%0.32%

4.2.4 画出vintage表格

如何优雅地画出vintage表格呢？本来只需要一步：cohort_table_plot(cohort_dat)即可，但由于汉森老师粗心大意，R语言CRAN库最新的creditmodel1.1.8版本的该函数有一些bug，不能一步画出来，因此我把修复了bug的源码贴出来，在画vintage表格前先加载这个函数。

#' cohort_table_plot

#' \code{cohort_table_plot} is for ploting cohort(vintage) analysis table.

#' @param cohort_dat A data.frame generated by \code{cohort_analysis}.

#' @import ggplot2

#' @export

cohort_table_plot=function(cohort_dat) {

#set global variables

opt=options('warn'=-1,scipen=200,stringsAsFactors=FALSE,digits=6)#

cohort_dat[is.na(cohort_dat)]=0

#initial parameters

Cohort_Group=Cohort_Period=Events=Events_rate=Opening_Total=

Retention_Total=cohor_dat=final_Events=m_a=max_age=NULL

#plot

cohort_plot=ggplot(cohort_dat,aes(reorder(paste0(Cohort_Period),Cohort_Period),

Cohort_Group,fill=Events_rate))+

geom_tile(colour='white')+

geom_text(aes(label=as_percent(Events_rate,4)),size=3)+

scale_fill_gradient2(limits=c(0,max(cohort_dat$Events_rate)),

low=love_color('deep_red'),mid='white',

high=love_color(),

midpoint=median(cohort_dat$Events_rate,

na.rm=TRUE),

na.value=love_color('pale_grey'))+

scale_y_discrete(limits=rev(unique(cohort_dat$Cohort_Group)))+

scale_x_discrete(position="top")+

labs(x="Cohort_Period",title="Cohort Analysis")+

theme(text=element_text(size=15),rect=element_blank())+

plot_theme(legend.position='right',angle=0)

return(cohort_plot)

options(opt)#reset global variables

}

creditmodel包的数据可视化模块依赖ggplot2包画图，因此在画图前，别忘了加载ggplot2

vin_table=cohort_table(vin_dat,obs_id='loan_id',occur_time='loan_time',MOB=NULL,

period='monthly',status="max_overdue_days",

dead_status=30,amount="loan_balance",by_out='amt',

start_date="2016-09-01",end_date='2017-07-31')

cohort_table_plot(cohort_dat)

5总结

R语言creditmodel包是集变量衍生、数据预处理、数据分析、建模、数据可视化为一体的强大的数据科学工具包，关于该包的更深入的使用，还请关注汉森老师的公众号hansenmode。

觉得本文有参考意义的同学请点个赞或者转发，以鼓励汉森老师出产更多的作品。

另外，以上分析过程所使用的数据为模拟的数据，没有任何实际参考价值。

使用R语言creditmodel包进行Vintage分析或留存率分析

推荐阅读更多精彩内容