1、日期和时间数据类型及工具
Python标准库中包含用于日期和时间的数据类型,而且还有日历方面的功能,我们主要会用到datetime、time和calendar模块,datetime.datetime是用的最多的数据类型。
from datetime import datetime
now = datetime.now()
now
#datetime.datetime(2017, 10, 9, 18, 17, 27, 413058)
now.year,now.month,now.day
#(2017, 10, 9)
datetime以毫秒形式存储日期和时间,datetime.timedelta表示两个datetime对象之间的时间差.
delta = datetime(2011,1,7) - datetime(2008,6,24,8,15)
delta.days,delta.seconds
#(926, 56700)
可以给datetime对象加上或者减去一个或多个timedelta,会产生一个新对象:
from datetime import timedelta
start = datetime(2011,1,7)
start - 2 * timedelta(12)
#datetime.datetime(2010, 12, 14, 0, 0)
利用str或者strftime方法,datetime对象和pandas的Timestamp对象可以被格式化为字符串:
stamp = datetime(2011,1,3)
str(stamp)
#'2011-01-03 00:00:00'
stamp.strftime('%Y-%m-%d')
#'2011-01-03'
datetime.strptime也可以用这些格式化编码将字符串转化为日期:
value = '2011-01-03'
datetime.strptime(value,'%Y-%m-%d')
#datetime.datetime(2011, 1, 3, 0, 0)
datetime.strptime是通过已知格式进行日期解析的最佳方式,但是每次都要编写格式定义很麻烦,尤其是对于一些常见的日期格式,这种情况下,可以用dateutil这个第三方包中的parser.parse方法,dateutil可以解析几乎所有人类能够理解的日期表示形式:
from dateutil.parser import parse
parse('2011-01-03')
#datetime.datetime(2011, 1, 3, 0, 0)
parse('Jan 31,1997 10:45 PM')
#datetime.datetime(2017, 1, 31, 22, 45)
在国际通用格式中,日通常出现在月的前面,传入dayfirst=True即可:
parse('6/12/2011',dayfirst=True)
#datetime.datetime(2011, 12, 6, 0, 0)
pandas通常是用于处理成组日期的,不管这些日期是DataFrame的轴索引还是列,to_datetime方法可以解析多种不同的日期表示形式。
datestrs = ['7/6/2011','8/6/2011']
pd.to_datetime(datestrs)
#DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)
to_datetime可以处理缺失值,NAT是pandas中时间戳数据的NA值:
pd.to_datetime(datestrs+[None])
#DatetimeIndex(['2011-07-06', '2011-08-06', 'NaT'], dtype='datetime64[ns]', freq=None)
2、时间序列基础
pandas最基本的时间序列类型就是以时间戳为索引的Series:
from datetime import datetime
dates = [datetime(2011,1,2),datetime(2011,1,5),datetime(2011,1,7),datetime(2011,1,8),datetime(2011,1,10),datetime(2011,1,12)]
ts = pd.Series(np.random.randn(6),index=dates)
ts
#输出
2011-01-02 -0.881964
2011-01-05 -0.554943
2011-01-07 -1.111905
2011-01-08 -0.941412
2011-01-10 -2.492096
2011-01-12 -1.871858
dtype: float64
这里的Series索引不是普通的索引,而是DatetimeIndex,而ts变为了一个TimeSeries,同时,可以看到,pandas用Numpy的datetime64数据类型以纳秒形式存储时间戳。
ts.index
#输出
DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
'2011-01-10', '2011-01-12'],
dtype='datetime64[ns]', freq=None)
跟其他Series一样,不同索引的时间序列之间的算数运算会自动对齐:
ts + ts[::2]
#输出
2011-01-02 -1.763929
2011-01-05 NaN
2011-01-07 -2.223810
2011-01-08 NaN
2011-01-10 -4.984192
2011-01-12 NaN
dtype: float64
DateTimeIndex中的各个标量值是pandas的Timestamp对象.
由于TimeSeries是Series的一个子类,所以在索引以及数据选曲方面他们的行为是一样的,但是我们还可以传入一个可以被解释为日期的字符串来进行索引:
ts['1/10/2011']
#-2.4920958699660636
ts['20110110']
#-2.4920958699660636
对于较长的时间序列,只需传入年或年月即可轻松选取数据的切片:
longer_ts = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
longer_ts['2001']
#输出
2001-01-01 0.430658
2001-01-02 0.238326
2001-01-03 0.742078
2001-01-04 2.026365
2001-01-05 1.119718
2001-01-06 0.051642
2001-01-07 -0.948585
2001-01-08 0.088678
2001-01-09 -0.093978
2001-01-10 -0.452213
2001-01-11 0.194490
2001-01-12 -0.791522
2001-01-13 0.994300
2001-01-14 -0.466681
2001-01-15 -0.104991
2001-01-16 0.398028
2001-01-17 -0.174210
2001-01-18 0.061167
2001-01-19 0.338023
2001-01-20 0.786561
2001-01-21 0.433504
2001-01-22 -0.097737
2001-01-23 1.651351
2001-01-24 -1.620065
2001-01-25 -1.369003
2001-01-26 -0.789305
2001-01-27 -0.770117
2001-01-28 -1.190250
2001-01-29 -0.457968
2001-01-30 1.594643
...
2001-12-02 0.022856
2001-12-03 -1.074076
2001-12-04 -0.342918
2001-12-05 0.736527
2001-12-06 0.192286
2001-12-07 0.020938
2001-12-08 1.494041
2001-12-09 0.848802
2001-12-10 0.023913
2001-12-11 0.164936
2001-12-12 0.427615
2001-12-13 -0.067649
2001-12-14 0.779254
2001-12-15 -0.753810
2001-12-16 0.950142
2001-12-17 1.494037
2001-12-18 0.134798
2001-12-19 -0.019051
2001-12-20 1.171783
2001-12-21 0.253665
2001-12-22 0.634205
2001-12-23 0.372734
2001-12-24 -0.382349
2001-12-25 0.023428
2001-12-26 0.273047
2001-12-27 -1.312320
2001-12-28 -0.431074
2001-12-29 -1.501706
2001-12-30 1.185465
2001-12-31 -0.452883
Freq: D, Length: 365, dtype: float64
longer_ts['2001-05']
#输出
2001-05-01 -0.903594
2001-05-02 -0.549671
2001-05-03 1.196419
2001-05-04 -0.965646
2001-05-05 -1.193606
2001-05-06 -0.762428
2001-05-07 0.216929
2001-05-08 -1.177503
2001-05-09 0.282163
2001-05-10 -0.938378
2001-05-11 0.200773
2001-05-12 0.723701
2001-05-13 -1.172896
2001-05-14 1.504694
2001-05-15 0.355133
2001-05-16 0.049116
2001-05-17 0.218060
2001-05-18 -0.513406
2001-05-19 -0.791606
2001-05-20 -1.703427
2001-05-21 -1.012035
2001-05-22 1.206804
2001-05-23 -0.345615
2001-05-24 1.813632
2001-05-25 -0.731229
2001-05-26 2.079715
2001-05-27 -1.140633
2001-05-28 1.356075
2001-05-29 1.644058
2001-05-30 -1.785124
2001-05-31 1.773346
Freq: D, dtype: float64
通过日期进行切片的方式只对规则Series有效:
ts[datetime(2011,1,7):]
#输出
2011-01-07 -1.111905
2011-01-08 -0.941412
2011-01-10 -2.492096
2011-01-12 -1.871858
dtype: float64
ts['1/6/2011':'1/11/2011']
#输出
2011-01-07 -1.111905
2011-01-08 -0.941412
2011-01-10 -2.492096
dtype: float64
还有一个等价的实例方法也可以截取两个日期之间的TimeSeries:
ts.truncate(after='1/9/2011')
#输出
2011-01-02 -0.881964
2011-01-05 -0.554943
2011-01-07 -1.111905
2011-01-08 -0.941412
dtype: float64
DataFrame也同样适用上面的规则
dates = pd.date_range('1/1/2000',periods=100,freq='W-WED')
long_df = pd.DataFrame(np.random.randn(100,4),index=dates,columns=['Colorado','Texas','New York','Ohio'])
long_df.loc['2001-5']
pandas中的时间序列一般被认为是不规则的,也就是说,没有固定的频率,对于大部分程序而言,这是无所谓的,但是,他常常需要以某种相对固定的频率进行分析,比如每月,每日,每15min等。pandas有一套标准时间序列频率以及用于重采样,频率推断,生成固定频率日期范围的工具.
例如,我们可以将之前的时间序列转换为一个具有固定频率(每日)的时间序列,只需调用resample即可.返回DatetimeIndexResampler,获取值使用asfreq():
ts1 = ts.resample('D').asfreq()
ts1
#输出
2011-01-02 -0.881964
2011-01-03 NaN
2011-01-04 NaN
2011-01-05 -0.554943
2011-01-06 NaN
2011-01-07 -1.111905
2011-01-08 -0.941412
2011-01-09 NaN
2011-01-10 -2.492096
2011-01-11 NaN
2011-01-12 -1.871858
Freq: D, dtype: float64
生成日期范围使用date_range函数
index = pd.date_range('4/1/2012','6/1/2012')
index
#输出
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
'2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
'2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
'2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
'2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
'2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
'2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
'2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
'2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
'2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
'2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
'2012-05-31', '2012-06-01'],
dtype='datetime64[ns]', freq='D')
默认情况下,date_range会产生按天计算的时间点,如果只传入起始或结束日期,那就还得传入一个表示一段时间的数字:
pd.date_range(start='4/1/2012',periods=20)
#输出
DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
'2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
'2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
'2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
'2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
dtype='datetime64[ns]', freq='D')
如果你不想按天生成数据,想要按照一定的频率生成,我们传入freq参数即可.如想按5小时生成数据:
pd.date_range(end='4/1/2012',periods=20,freq='5H')
#输出
DatetimeIndex(['2012-03-28 01:00:00', '2012-03-28 06:00:00',
'2012-03-28 11:00:00', '2012-03-28 16:00:00',
'2012-03-28 21:00:00', '2012-03-29 02:00:00',
'2012-03-29 07:00:00', '2012-03-29 12:00:00',
'2012-03-29 17:00:00', '2012-03-29 22:00:00',
'2012-03-30 03:00:00', '2012-03-30 08:00:00',
'2012-03-30 13:00:00', '2012-03-30 18:00:00',
'2012-03-30 23:00:00', '2012-03-31 04:00:00',
'2012-03-31 09:00:00', '2012-03-31 14:00:00',
'2012-03-31 19:00:00', '2012-04-01 00:00:00'],
dtype='datetime64[ns]', freq='5H')
如果你想生成一个由每月最后一个工作日组成的日期索引,可以使用BM频率:
pd.date_range('1/1/2000','12/1/2000',freq='BM')
#输出
DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
'2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
'2000-09-29', '2000-10-31', '2000-11-30'],
dtype='datetime64[ns]', freq='BM')
date_range默认会保留起始和结束的时间戳的时间信息,但是如果我们想产生一组规范化到午夜的时间戳,normalize选项可以实现这个功能:
pd.date_range('5/2/2012 12:56:31',periods=5)
#输出
DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
'2012-05-04 12:56:31', '2012-05-05 12:56:31',
'2012-05-06 12:56:31'],
dtype='datetime64[ns]', freq='D')
pd.date_range('5/2/2012 12:56:31',periods=5,normalize=True)
#输出
Out[46]:
DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
'2012-05-06'],
dtype='datetime64[ns]', freq='D')
WOM(week of Month)是一种非常实用的频率类,它以WOM开头,它使你能获得诸如每月第三个星期五之类的日期:
rng = pd.date_range('1/1/2012','9/1/2012',freq='WOM-3FRI')
rng
#输出
DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
'2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
dtype='datetime64[ns]', freq='WOM-3FRI')