pandas入门
简介
pandas包含的数据结构和操作工具能快速简单地清洗和分析数据。
pandas经常与NumPy和SciPy这样的数据计算工具,statsmodels和scikit-learn之类的分析库及数据可视化库(如matplotlib)等一起用使用。pandas基于NumPy的数组,经常可以不使用循环就能处理好大量数据。
pandas适合处理表格数据或巨量数据。NumPy则适合处理巨量的数值数组数据。
这里约定导入方式:
技术支持qq群:521070358 630011153
#!python
import pandas as pd
参考资料
主要数据结构:Series和DataFrame。
Series
Series类似于一维数组的对象,它由一组数据(NumPy类似数据类型)以及相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:
#!python
In [2]: import pandas as pd
In [3]: obj = pd.Series([4, 7, -5, 3])
In [4]: obj
Out[4]:
0 4
1 7
2 -5
3 3
dtype: int64
In [5]: obj.values
Out[5]: array([ 4, 7, -5, 3])
In [6]: obj.index
Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')
指定索引:
#!python
In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
In [3]: obj2
Out[3]:
d 4
b 7
a -5
c 3
dtype: int64
In [4]: obj2.index
Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')
In [10]: obj2['a']
Out[10]: -5
In [11]: obj2['d'] = 6
In [12]: obj2[['c', 'a', 'd']]
Out[12]:
c 3
a -5
d 6
dtype: int64
可见与普通NumPy数组相比,你还可以通过索引的方式选取Series中的值。
NumPy函数或类似操作,如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接:
#!python
In [13]: obj2[obj2 > 0]
Out[13]:
d 6
b 7
c 3
dtype: int64
In [14]: obj2 * 2
Out[14]:
d 12
b 14
a -10
c 6
dtype: int64
In [15]: obj2
Out[15]:
d 6
b 7
a -5
c 3
dtype: int64
In [17]: import numpy as np
In [18]: np.exp(obj2)
Out[18]:
d 403.428793
b 1096.633158
a 0.006738
c 20.085537
dtype: float64
In [19]: 'b' in obj2
Out[19]: True
In [20]: 'e' in obj2
Out[20]: False
可见可以吧Series看成是定长的有序字典。也可由字典创建Series:
#!python
In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
In [22]: obj3 = pd.Series(sdata)
In [23]: obj3
Out[23]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']
In [25]: obj4 = pd.Series(sdata, index=states)
In [26]: obj4
Out[26]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [27]: pd.isnull(obj4)
Out[27]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [28]: pd.notnull(obj4)
Out[28]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
In [29]: obj4.isnull()
Out[29]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
In [32]: obj4.notnull()
Out[32]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
相加
#!python
In [33]: obj3
Out[33]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
In [34]: obj4
Out[34]:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
In [35]: obj3 + obj4
Out[35]:
California NaN
Ohio 70000.0
Oregon 32000.0
Texas 142000.0
Utah NaN
dtype: float64
In [36]: obj4.name = 'population'
In [37]: obj4.index.name = 'state'
In [38]: obj4
Out[38]:
state
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
In [40]: obj = pd.Series([4, 7, -5, 3])
In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
In [42]: obj
Out[42]:
Bob 4
Steve 7
Jeff -5
Ryan 3
dtype: int64
本文代码地址:https://github.com/china-testing/python-api-tesing/
本文最新版本地址:http://t.cn/R8tJ9JH
交流QQ群:python 测试开发 144081101
wechat: pythontesting
淘宝天猫可以把链接发给qq850766020,为你生成优惠券,降低你的购物成本!
DataFrame
DataFrame是矩状表格型的数据结构,包含有序的列,每列可以是不同的类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看做由相同索引的Series组成的字典。DataFrame中的数据是以一个或多个二维块存放的。
构建DataFrame的办法有很多,最常用的是直接传入等长列表或NumPy数组组成的字典。DataFrame会自动加上索引(跟Series一样),有序排列。
#!python
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
In [4]:
In [4]: frame = pd.DataFrame(data)
In [5]: frame
Out[5]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
In [6]: frame.head()
Out[6]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In [7]:
In [7]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
Out[7]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
In [8]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
...: index=['one', 'two', 'three', 'four', 'five', 'six'])
In [9]: frame2
Out[9]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
In [10]: frame2['state']
Out[10]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
可见还可以通过columns指定DataFrame的列序, index指定索引名。跟Series一样,如果传入的列在数据中找不到,就会产生NaN值。通过类似字典的方式或属性的方式,可以将DataFrame的列获取为Series,返回的Series拥有DataFrame相同的索引,且其name属性也已经被相应地设置好。
行也可以用loc属性通过位置或名称的方式进行获取。列可以通过赋值的方式进行修改。
将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。如果赋值的是Series,就会精确匹配
#!python
In [11]: frame2.loc['three']
Out[11]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
In [12]: frame2['debt'] = 16.5
In [13]: frame2
Out[13]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
six 2003 Nevada 3.2 16.5
In [14]: frame2['debt'] = np.arange(6.)
In [15]: frame2
Out[15]:
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
In [17]: frame2['debt'] = val
In [18]: frame2
Out[18]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
six 2003 Nevada 3.2 NaN
为不存在的列赋值会创建出一个新列。关键字del用于删除列:
#!python
In [19]: frame2['eastern'] = frame2['state'] == 'Ohio'
In [20]: frame2
Out[20]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
six 2003 Nevada 3.2 NaN False
In [21]: del frame2['eastern']
In [22]: frame2.columns
Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
通过索引方式返回的列只是相应数据的视图而不是副本。因此,对返回的Series所做的任何就地修改
全都会反映到源DataFrame上。通过Series的copy方法即可显式地复制列。
另一种常见的数据形式是嵌套字典,外层字典的键作为列,内层键则作为行索引:
#!python
In [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [24]: frame3 = pd.DataFrame(pop)
In [25]: frame3
Out[25]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [26]: frame3.T
Out[26]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
In [27]: pd.DataFrame(pop, index=[2001, 2002, 2003])
Out[27]:
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
In [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}
In [29]: pdata
Out[29]:
{'Ohio': 2000 1.5
2001 1.7
Name: Ohio, dtype: float64, 'Nevada': 2000 NaN
2001 2.4
Name: Nevada, dtype: float64}
In [30]: pd.DataFrame(pdata)
Out[30]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'
In [32]: frame3
Out[32]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [33]: frame3.values
Out[33]:
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
In [34]: frame2.values
Out[34]:
array([[2000, 'Ohio', 1.5, nan],
[2001, 'Ohio', 1.7, -1.2],
[2002, 'Ohio', 3.6, nan],
[2001, 'Nevada', 2.4, -1.5],
[2002, 'Nevada', 2.9, -1.7],
[2003, 'Nevada', 3.2, nan]], dtype=object)
可见可以转置,由Series组成的字典和字典类似。如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来。跟Series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据。如果DataFrame各列的数据类型不同,则值数组的数据类型就会选用能兼容所有列的数据类型。
DataFrame的constructor接受的类型为:2D ndarray、dict of arrays, lists, or tuples、NumPy structured/record、array、dict of Series、dict of dicts、List of dicts or Series、List of lists or tuples、Another DataFrame、NumPy MaskedArray。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成Index。
#!python
In [35]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
In [36]: index = obj.index
In [37]: index
Out[37]: Index(['a', 'b', 'c'], dtype='object')
In [38]: index[1:]
Out[38]: Index(['b', 'c'], dtype='object')
In [39]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'
/usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
1722
1723 def __setitem__(self, key, value):
-> 1724 raise TypeError("Index does not support mutable operations")
1725
1726 def __getitem__(self, key):
TypeError: Index does not support mutable operations
In [40]: labels = pd.Index(np.arange(3))
In [41]: labels
Out[41]: Int64Index([0, 1, 2], dtype='int64')
In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
In [43]: obj2
Out[43]:
0 1.5
1 -2.5
2 0.0
dtype: float64
In [44]: obj2.index is labels
Out[44]: True
In [45]: frame3
Out[45]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [46]: frame3.columns
Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
In [47]: 'Ohio' in frame3.columns
Out[47]: True
In [48]: 2003 in frame3.index
Out[48]: False
In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
In [50]: dup_labels
Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
Index对象是不可变的,因此用户不能对其进行修改,这样Index对象在多个数据结构之间可安全共享。除了像数组,Index类似固定大小的集合。
Index的方法和属性有:append,difference,intersection,union,isin,delete,drop,insert,is_monotonic,unique。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
基本功能
本节中,我将介绍操作Series和DataFrame中的数据的基本手段。
重新索引
#!python
In [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [52]: obj
Out[52]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
# 调用reindex将会根据新索引进行重排。如果某个索引值当前不存在,就为NaN
In [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
In [54]: obj2
Out[54]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [56]: obj3
Out[56]:
0 blue
2 purple
4 yellow
dtype: object
# 对于时间序列这样的有序数据,重新索引时可能需要做插值处理。method选项即可达到此目的,例如,使用ffill以实现前向值填充:
In [57]: obj3.reindex(range(6), method='ffill')
Out[57]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
# DataFrame中reindex可以调整行列
In [58]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
....: index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])
In [59]: frame
Out[59]:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
In [61]: frame2
Out[61]:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
In [62]: states = ['Texas', 'Utah', 'California']
In [63]: frame.reindex(columns=states)
Out[63]:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [69]: frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states)
In [70]: frame2
Out[70]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
reindex函数的参数有index,method,fill_value,limit,tolerance,level,copy等。
更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
丢弃指定轴上的项
丢弃某条轴上的一项很简单,只要有索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是在指定轴上删除了指定值的新对象:
#!python
In [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [72]: obj
Out[72]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [73]: new_obj = obj.drop('c')
In [74]: new_obj
Out[74]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [75]: obj
Out[75]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [76]: obj.drop(['d', 'c'])
Out[76]:
a 0.0
b 1.0
e 4.0
dtype: float64
In [77]: obj
Out[77]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [78]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
In [79]: data
Out[79]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [80]: data.drop(['Colorado', 'Ohio'])
Out[80]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In []: data.drop('two',1)
Out[57]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In []: data.drop('two', axis=1)
Out[58]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In []: data.drop(['two', 'four'], axis='columns')
Out[59]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
In []: obj.drop('c', inplace=True)
In []: obj
Out[61]:
d 4.5
b 7.2
a -5.3
dtype: float64
索引、选取和过滤
Series索引(obj[...])的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数。下面是几个例子:
#!python
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
Out[63]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
obj['b']
Out[64]: 1.0
obj[1]
Out[65]: 1.0
obj[2:4]
Out[66]:
c 2.0
d 3.0
dtype: float64
obj[['b', 'a', 'd']]
Out[67]:
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1, 3]]
Out[68]:
b 1.0
d 3.0
dtype: float64
obj[obj < 2]
Out[69]:
a 0.0
b 1.0
dtype: float64
obj['b':'c']
Out[70]:
b 1.0
c 2.0
dtype: float64
obj['b':'c'] = 5
obj
Out[72]:
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
注意标签的方式和python的列表不同,后面的index也是包含在里面的。
#!python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
Out[74]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Out[75]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three', 'one']]
Out[76]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data[:2]
Out[77]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
Out[78]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
Out[79]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
Out[81]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
- loc和iloc
对于行上的DataFrame标签索引有特殊的索引操作符loc(标签)和iloc(整数索引)能够从DataFrame中选择子集。
#!python
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
Out[74]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']
Out[75]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[['three', 'one']]
Out[76]:
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
data[:2]
Out[77]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three'] > 5]
Out[78]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data < 5
Out[79]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data < 5] = 0
data
Out[81]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.loc['Colorado', ['two', 'three']]
Out[82]:
two 5
three 6
Name: Colorado, dtype: int32
data.iloc[2, [3, 0, 1]]
Out[83]:
four 11
one 8
two 9
Name: Utah, dtype: int32
data.iloc[2]
Out[84]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
data.iloc[[1, 2], [3, 0, 1]]
Out[85]:
four one two
Colorado 7 0 5
Utah 11 8 9
data.loc[:'Utah', 'two']
Out[86]:
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
data.iloc[:, :3][data.three > 5]
Out[87]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
注意ix现在已经不推荐使用。
整数索引(Integer Indexes)
pandas对象的整数索引与内置Python数据的索引语义存在一些差异,以下代码会生成错误:
#!python
ser = pd.Series(np.arange(3.))
ser[-1]
Traceback (most recent call last):
File "<ipython-input-20-3cbe0b873a9e>", line 1, in <module>
ser[-1]
File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: -1
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
Out[22]: 2.0
ser[:1]
Out[23]:
0 0.0
dtype: float64
ser.loc[:1]
Out[24]:
0 0.0
1 1.0
dtype: float64
ser.iloc[:1]
Out[25]:
0 0.0
dtype: float64
算术和数据对齐
pandas可在不同索引的对象建进行算术运算,类似数据库的连接:
#!python
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
Out[28]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
s2
Out[29]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
s1 + s2
Out[30]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
Out[33]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
df2
Out[34]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
df1 + df2
Out[35]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
Out[38]:
A
0 1
1 2
df2
Out[39]:
B
0 3
1 4
df1 - df2
Out[40]:
A B
0 NaN NaN
1 NaN NaN
还可以进行值的填充
#!python
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
columns=list('abcde'))
df1
Out[43]:
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
Out[44]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df2.loc[1, 'b'] = np.nan
df2
Out[46]:
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 NaN 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
Out[47]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 NaN 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0)
Out[48]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
1 / df1
Out[49]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
df1.rdiv(1)
Out[50]:
a b c d
0 inf 1.000000 0.500000 0.333333
1 0.250000 0.200000 0.166667 0.142857
2 0.125000 0.111111 0.100000 0.090909
df1.reindex(columns=df2.columns, fill_value=0)
Out[53]:
a b c d e
0 0.0 1.0 2.0 3.0 0
1 4.0 5.0 6.0 7.0 0
2 8.0 9.0 10.0 11.0 0
Method | Description |
---|---|
add, radd | for addition (+) |
sub, rsub | for subtraction (-) |
div, rdiv | for division (/) |
floordiv, rfloordiv | for floor division (//) |
mul, rmul | for multiplication (*) |
pow, rpow | for exponentiation (**) |
- DataFrame和Series间的操作
默认基于行进行广播,用( axis='index' or axis=0 )可以基于列进行广播。
#!python
arr = np.arange(12.).reshape((3, 4))
arr
Out[55]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
Out[56]: array([ 0., 1., 2., 3.])
arr - arr[0]
Out[57]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
arr
Out[58]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]
frame
Out[61]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series
Out[62]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
frame - series
Out[63]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2
Out[65]:
b 0
e 1
f 2
dtype: int32
frame + series2
Out[66]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
series3 = frame['d']
frame
Out[69]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
series3
Out[70]:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
frame.sub(series3, axis='index')
Out[71]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
函数应用和映射
NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:
另一个常见的操作是将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能:
许多最为常见的数组统计功能都被实现成DataFrame的方法(如sum和mean),因此无需使用apply方法。除标量值外,传递给apply的函数还可以返回由多个值组成的Series:
此外,元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字符串,使用applymap即可:
之所以叫做applymap,是因为Series有一个用于应用元素级函数的map方法:
#!python
arr = np.arange(12.).reshape((3, 4))
arr
Out[73]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]
Out[74]: array([ 0., 1., 2., 3.])
arr - arr[0]
Out[75]:
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame
Out[77]:
b d e
Utah 0.255395 1.983985 0.936326
Ohio 0.319394 2.231544 -0.051256
Texas -0.041388 -0.026032 -0.446722
Oregon 1.099475 -1.432638 -0.919189
np.abs(frame)
Out[78]:
b d e
Utah 0.255395 1.983985 0.936326
Ohio 0.319394 2.231544 0.051256
Texas 0.041388 0.026032 0.446722
Oregon 1.099475 1.432638 0.919189
f = lambda x: x.max() - x.min()
frame.apply(f)
Out[80]:
b 1.140863
d 3.664181
e 1.855515
dtype: float64
frame.apply(f, axis='columns')
Out[81]:
Utah 1.728590
Ohio 2.282800
Texas 0.420690
Oregon 2.532113
dtype: float64
def f(x):
return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
Out[83]:
b d e
min -0.041388 -1.432638 -0.919189
max 1.099475 2.231544 0.936326
format = lambda x: '%.2f' % x
frame.applymap(format)
Out[85]:
b d e
Utah 0.26 1.98 0.94
Ohio 0.32 2.23 -0.05
Texas -0.04 -0.03 -0.45
Oregon 1.10 -1.43 -0.92
frame['e'].map(format)
Out[86]:
Utah 0.94
Ohio -0.05
Texas -0.45
Oregon -0.92
Name: e, dtype: object
排序和排名
根据条件对数据集排序(sorting)也是重要的内置运算。要对行或列索引进行排序(按字典顺序),可使用sort_index方法,它将返回一个已排序的新对象。
而对于DataFrame,则可以根据任意轴上的索引进行排序:
#!python
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
Out[88]:
a 1
b 2
c 3
d 0
dtype: int32
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
frame
Out[90]:
d a b c
three 0 1 2 3
one 4 5 6 7
frame.sort_index()
Out[91]:
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis='columns')
Out[94]:
a b c d
three 1 2 3 0
one 5 6 7 4
数据默认是按升序排序的,但也可以降序排序,若要按值对Series进行排序,可使用其order方法。在排序时,任何缺失值默认都会被放到Series的末尾。
#!python
frame.sort_index(axis='columns', ascending=False)
Out[95]:
d c b a
three 0 3 2 1
one 4 7 6 5
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()
Out[97]:
2 -3
3 2
0 4
1 7
dtype: int64
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
Out[99]:
4 -3.0
5 2.0
0 4.0
2 7.0
1 NaN
3 NaN
dtype: float64
obj.sort_values(ascending=False)
Out[100]:
2 7.0
0 4.0
5 2.0
4 -3.0
1 NaN
3 NaN
dtype: float64
在DataFrame上,你可能希望根据一个或多个列中的值进行排序。将一个或多个列的名字传递给by选项即可。要根据多个列进行排序,传入名称的列表即可:
#!python
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
Out[102]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2
frame.sort_values(by='b')
Out[103]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
frame.sort_values(by=['a', 'b'])
Out[104]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
排名(ranking)跟排序关系密切,且它会增设排名值(从1开始,一直到数组中有效数据的数量)。它跟numpy.argsort产生的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。接下来介绍Series和DataFrame的rank方法。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系的:
也可以根据值在原数据中出现的顺序给出排名
当然,你也可以按降序进行排名:
#!python
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
Out[106]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
obj.rank(method='first')
Out[107]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
obj.rank(ascending=False, method='max')
Out[108]:
0 2.0
1 7.0
2 2.0
3 4.0
4 5.0
5 6.0
6 4.0
dtype: float64
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
Out[110]:
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
frame.rank(axis='columns')
Out[111]:
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0
Method | Description |
---|---|
'average' | Default: assign the average rank to each entry in the equal group |
'min' | Use the minimum rank for the whole group |
'max' | Use the maximum rank for the whole group |
'first' | Assign ranks in the order the values appear in the data |
'dense' | Like method='min' , but ranks always increase by 1 in between groups rather than the number of equal |
elements in a group |
带有重复值的轴索引
直到目前为止,我所介绍的所有范例都有着唯一的轴标签(索引值)。虽然许多pandas函数(如reindex)都要求标签唯一,但这并不是强制性的。
#!python
import pandas as pd
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj
Out[3]:
a 0
a 1
b 2
b 3
c 4
dtype: int32
obj.index.is_unique
Out[4]: False
obj['a']
Out[5]:
a 0
a 1
dtype: int32
obj['c']
Out[6]: 4
import numpy as np
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[10]:
0 1 2
a 0.835470 0.465657 -0.068212
a -1.067020 1.148283 1.722324
b 0.057184 -0.441111 -0.388286
b -0.363911 -0.599963 0.126594
df.loc['b']
Out[11]:
0 1 2
b 0.057184 -0.441111 -0.388286
b -0.363911 -0.599963 0.126594
汇总和计算描述统计
pandas对象拥有一组常用的数学和统计方法。它们大部分都属于reduction和summary统计,用于从Series中提取单个值(如sum或mean)或从DataFrame的行或列中提取Series。跟对应的NumPy数组方法相比,它们都是基于没有缺失数据的假设而构建的。接下来看一个简单DataFrame:
#!python
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
df
Out[14]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.sum()
Out[15]:
one 9.25
two -5.80
dtype: float64
df.sum(axis='columns')
Out[16]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
df.mean(axis='columns', skipna=False)
Out[17]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
df.mean(axis='columns')
Out[18]:
a 1.400
b 1.300
c NaN
d -0.275
dtype: float64
Method | Description |
---|---|
axis | Axis to reduce over; 0 for DataFrame’s rows and 1 for columns |
skipna | Exclude missing values; True by default |
level | Reduce grouped by level if the axis is hierarchically indexed (MultiIndex) |
有些方法(如idxmin和idxmax)返回的是间接统计(比如达到最小值或最大值的索引),cumsum则为累计求和,describe则为汇总统计。
#!python
df
Out[19]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
df.idxmax()
Out[20]:
one b
two d
dtype: object
df.cumsum()
Out[21]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
df.describe()
Out[22]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
Out[24]:
count 16
unique 3
top a
freq 8
dtype: object
Method | Description |
---|---|
count | Number of non-NA values |
describe | Compute set of summary statistics for Series or each DataFrame column |
min, max | Compute minimum and maximum values |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
idxmin, idxmax | Compute index labels at which minimum or maximum value obtained, respectively |
quantile | Compute sample quantile ranging from 0 to 1 |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median (50% quantile) of values |
mad | Mean absolute deviation from mean value |
prod | Product of all values |
var | Sample variance of values |
std | Sample standard deviation of values |
skew | Sample skewness (third moment) of values |
kurt | Sample kurtosis (fourth moment) of values |
cumsum | Cumulative sum of values |
cummin, cummax | Cumulative minimum or maximum of values, respectively |
cumprod | Cumulative product of values |
diff | Compute first arithmetic difference (useful for time series) |
pct_change | Compute percent changes |
相关性和方差
一些汇总统计,如相关和方差,是从成对的参数程程。 让我们考虑一些来自Yahoo的股票价格和数量DataFrame! 使用附加的pandas-datareader包,
暂略
唯一值、值计数以及成员资格
还有一类方法可以从一维Series的值中抽取信息。以下面这个Series为例:
#!python
import pandas as pd
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
uniques
Out[9]: array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
Out[10]:
a 3
c 3
b 2
d 1
dtype: int64
pd.value_counts(obj.values, sort=False)
Out[11]:
c 3
d 1
b 2
a 3
dtype: int64
obj
Out[12]:
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object
mask = obj.isin(['b', 'c'])
mask
Out[14]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
obj[mask]
Out[15]:
0 c
5 b
6 b
7 c
8 c
dtype: object
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
unique_vals = pd.Series(['c', 'b', 'a'])
pd.Index(unique_vals).get_indexer(to_match)
Out[18]: array([0, 2, 1, 1, 0, 2], dtype=int64)
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
data
Out[20]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
result = data.apply(pd.value_counts).fillna(0)
result
Out[22]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
Method | Description |
---|---|
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence ofvalues |
match | Compute integer indices for each value in an array into another array of distinct values; helpful for data |
alignment and join-type operations | |
unique | Compute array of unique values in a Series, returned in the order observed |
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in |
descending order |
数据清洗和准备
在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。
在本章中,我将讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。在下一章中,我将重点放在组合上,并以各种方式重新排列数据集。
处理缺失数据
数值用浮点数NaN (Not a Number)表示缺失。
#!python
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [4]: string_data
Out[4]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [5]: string_data.isnull()
Out[5]:
0 False
1 False
2 True
3 False
dtype: bool
In [6]: string_data[0] = None
In [7]: string_data.isnull()
Out[7]:
0 True
1 False
2 True
3 False
dtype: bool
NA相关的处理方法
数据缺失用NA(not available)表示, python内置的None也为NA。
Argument | Description |
---|---|
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' . |
isnull | Return boolean values indicating which values are missing/NA. |
notnull | Negation of isnull . |
#!python
In [8]: from numpy import nan as NA
In [9]: data = pd.Series([1, NA, 3.5, NA, 7])
In [10]: data.dropna()
Out[10]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [11]: data[data.notnull()]
Out[11]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
....: [NA, NA, NA], [NA, 6.5, 3.]])
In [13]: cleaned = data.dropna()
In [14]: data
Out[14]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [15]: cleaned
Out[15]:
0 1 2
0 1.0 6.5 3.0
In [16]: data.dropna(how='all')
Out[16]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [17]: data[4] = NA
In [18]: data
Out[18]:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
In [19]: data.dropna((axis='columns', how='all')
Out[19]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。
#!python
In [21]: df = pd.DataFrame(np.random.randn(7, 3))
In [22]: df.iloc[:4, 1] = NA
In [23]: df.iloc[:2, 2] = NA
In [24]: df
Out[24]:
0 1 2
0 -0.843340 NaN NaN
1 -1.305941 NaN NaN
2 1.026378 NaN 2.176567
3 0.048885 NaN 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [25]: df.dropna()
Out[25]:
0 1 2
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [26]: df.dropna(thresh=2)
Out[26]:
0 1 2
2 1.026378 NaN 2.176567
3 0.048885 NaN 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。
#!python
In [27]: df.fillna(0)
Out[27]:
0 1 2
0 -0.843340 0.000000 0.000000
1 -1.305941 0.000000 0.000000
2 1.026378 0.000000 2.176567
3 0.048885 0.000000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [28]: df.fillna({1: 0.5, 2: 0})
Out[28]:
0 1 2
0 -0.843340 0.500000 0.000000
1 -1.305941 0.500000 0.000000
2 1.026378 0.500000 2.176567
3 0.048885 0.500000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [29]: _ = df.fillna(0, inplace=True)
In [30]: df
Out[30]:
0 1 2
0 -0.843340 0.000000 0.000000
1 -1.305941 0.000000 0.000000
2 1.026378 0.000000 2.176567
3 0.048885 0.000000 0.012649
4 0.591212 -0.739625 1.017533
5 0.633873 -0.124162 -0.823495
6 -1.537827 0.802565 0.359058
In [31]: df = pd.DataFrame(np.random.randn(6, 3))
In [32]: df.iloc[2:, 1] = NA
In [33]: df.iloc[4:, 2] = NA
In [34]: df
Out[34]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 NaN 1.272047
3 0.293133 NaN 0.273399
4 0.235116 NaN NaN
5 1.365186 NaN NaN
In [35]: df.fillna(method='ffill')
Out[35]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 0.977842 1.272047
3 0.293133 0.977842 0.273399
4 0.235116 0.977842 0.273399
5 1.365186 0.977842 0.273399
In [36]: df.fillna(method='ffill', limit=2)
Out[36]:
0 1 2
0 -0.081265 -0.820770 -0.746845
1 1.150648 0.977842 0.861825
2 1.823679 0.977842 1.272047
3 0.293133 0.977842 0.273399
4 0.235116 NaN 0.273399
5 1.365186 NaN 0.273399
In [37]: data = pd.Series([1., NA, 3.5, NA, 7])
In [38]: data.fillna(data.mean())
Out[38]:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
Argument | Description |
---|---|
value | Scalar value or dict-like object to use to fill missing values |
method | Interpolation; by default 'ffill' if function called with no other arguments |
axis | Axis to fill on; default axis=0 |
inplace | Modify the calling object without producing a copy |
limit | For forward and backward filling, maximum number of consecutive periods to fill |
数据转换
去重
#!python
In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
....: 'k2': [1, 1, 2, 3, 3, 4, 4]})
In [40]: data
Out[40]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
In [41]: data.duplicated()
Out[41]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
In [42]: data.drop_duplicates()
Out[42]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
In [43]: data['v1'] = range(7)
In [44]: data.drop_duplicates(['k1'])
Out[44]:
k1 k2 v1
0 one 1 0
1 two 1 1
In [45]: data.drop_duplicates(['k1', 'k2'], keep='last')
Out[45]:
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
使用函数或者映射(map)转换数据
#!python
import pandas as np
import pandas as pd
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Out[5]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data['food'].str.lower()
lowercased
Out[8]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
Out[10]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[11]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
替换
#!python
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
In [5]: data
Out[5]:
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
In [6]: data.replace(-999, np.nan)
Out[6]:
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
In [7]: data.replace([-999, -1000], np.nan)
Out[7]:
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
In [8]: data.replace([-999, -1000], [np.nan, 0])
Out[8]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
In [9]: data.replace({-999: np.nan, -1000: 0})
Out[9]:
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
索引和列名修改
#!python
In [2]: import pandas as pd
In [3]: import numpy as np
In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
....: index=['Ohio', 'Colorado', 'New York'],
....: columns=['one', 'two', 'three', 'four'])
In [11]: data
Out[11]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [5]: data.replace(4, 40)
Out[5]:
one two three four
Ohio 0 1 2 3
Colorado 40 5 6 7
New York 8 9 10 11
In [12]: transform = lambda x: x[:4].upper()
In [13]: data.index.map(transform)
Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
In [14]: data
Out[14]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
In [15]: data.index = data.index.map(transform)
In [16]: data
Out[16]:
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
In [17]: data.rename(index=str.title, columns=str.upper)
Out[17]:
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
Out[18]:
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
In [20]: data
Out[20]:
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
离散化和面元划分
以下暂略
字符串处理
#!python
In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}
In [8]: data = pd.Series(data)
In [9]: data
Out[9]:
Dave dave@google.com
Rob rob@gmail.com
Steve steve@gmail.com
Wes NaN
dtype: object
In [10]: data.isnull()
Out[10]:
Dave False
Rob False
Steve False
Wes True
dtype: bool
In [11]: data.str.contains('gmail')
Out[11]:
Dave False
Rob True
Steve True
Wes NaN
dtype: object
In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
In [13]: data.str.findall(pattern, flags=re.IGNORECASE)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-085c16e4dbfe> in <module>()
----> 1 data.str.findall(pattern, flags=re.IGNORECASE)
NameError: name 're' is not defined
In [14]: import re
In [15]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[15]:
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [17]: matches
Out[17]:
Dave True
Rob True
Steve True
Wes NaN
dtype: object
In [18]: matches.str.get(1)
Out[18]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
In [19]: matches.str[0]
Out[19]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
In [20]: data.str[:5]
Out[20]:
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object
Method | Description |
---|---|
cat | Concatenate strings element-wise with optional delimiter |
contains | Return boolean array if each string contains pattern/regex |
count | Count occurrences of pattern |
extract | Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group |
endswith | Equivalent to x.endswith(pattern) for each element |
startswith | Equivalent to x.startswith(pattern) for each element |
findall | Compute list of all occurrences of pattern/regex for each string |
get | Index into each element (retrieve i-th element) |
isalnum | Equivalent to built-in str.alnum |
isalpha | Equivalent to built-in str.isalpha |
isdecimal | Equivalent to built-in str.isdecimal |
isdigit | Equivalent to built-in str.isdigit |
islower | Equivalent to built-in str.islower |
isnumeric | Equivalent to built-in str.isnumeric |
isupper | Equivalent to built-in str.isupper |
join | Join strings in each element of the Series with passed separator |
len | Compute length of each string |
lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element |
match | Use re.match with the passed regular expression on each element, returning matched groups as list |
pad | Add whitespace to left, right, or both sides of strings |
center | Equivalent to pad(side='both') |
repeat | Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string) |
replace | Replace occurrences of pattern/regex with some other string |
slice | Slice each string in the Series |
split | Split strings on delimiter or regular expression |
strip | Trim whitespace from both sides, including newlines |
rstrip | Trim whitespace on right side |
lstrip | Trim whitespace on left side |
数据争夺:连接,合并,和重塑
在许多应用程序中,数据可能分布在多个文件或数据库中,或者是以不易分析的形式。 本章重点介绍连接,合并,和重塑。
分层索引
#!python
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
Out[5]:
a 1 -1.111004
2 -0.451764
3 -0.501180
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
d 2 1.432663
3 -1.660043
dtype: float64
data['b']
Out[6]:
1 1.007739
3 0.407470
dtype: float64
data['b':'c']
Out[7]:
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
dtype: float64
data.loc[['b', 'd']]
Out[8]:
b 1 1.007739
3 0.407470
d 2 1.432663
3 -1.660043
dtype: float64
data.loc[:, 2]
Out[9]:
a -0.451764
c 0.608742
d 1.432663
dtype: float64
data.unstack()
Out[10]:
1 2 3
a -1.111004 -0.451764 -0.501180
b 1.007739 NaN 0.407470
c -0.307985 0.608742 NaN
d NaN 1.432663 -1.660043
data.unstack().stack()
Out[11]:
a 1 -1.111004
2 -0.451764
3 -0.501180
b 1 1.007739
3 0.407470
c 1 -0.307985
2 0.608742
d 2 1.432663
3 -1.660043
dtype: float64
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame
Out[13]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[16]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame['Ohio']
Out[17]:
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
- 重新排序和排序级别
以下暂略
联结和合并数据集
pandas.merge可根据键将不同DataFrame中的行连接起来。SQL或其他关系型数据库的用户对此应该会比较
熟悉,因为它实现的就是数据库的连接操作。pandas.concat可以沿轴将多个对象堆叠到一起。
实例方法combine_first可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。
数据库风格的DataFrame合并
#!python
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1
Out[20]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df2
Out[22]:
data2 key
0 0 a
1 1 b
2 2 d
pd.merge(df1, df2)
Out[23]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1, df2, on='key')
Out[24]:
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
Out[27]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
pd.merge(df1, df2, how='outer')
Out[28]:
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
Option | Behavior |
---|---|
'inner' | Use only the key combinations observed in both tables |
'left' | Use all key combinations found in the left table |
'right' | Use all key combinations found in the right table |
'output' | Use all key combinations observed in both tables together |
上面是多对一合并,下面看下多对多
数据聚合与分组运算
对数据集进行分组并对各组应用函数(无论是聚合还是转换)是数据分析工作中的重要环节。在将数据集准备好之后,通常的任务就是计算分组统计或生成透视表。
pandas提供了灵活高效的gruopby功能,能以自然的方式对数据集进行切片、切块、摘要等操作。关系型数据库和SQL(Structured Query Language,结构化查询语言)能够如此流行的原因之一就是其能够方便地对数据进行连接、过滤、转换和聚合。但是,像SQL这样的查询语
言所能执行的分组运算的种类很有限。
在本章中你将会看到,由于Python和pandas强大的表达能力,我们可以执行复杂得多的分组运算(利用任何可以接受pandas对象或NumPy数组的函数)。在本章中,你将会学到:
- 根据一个或多个键(可以是函数、数组或DataFrame列名)拆分pandas对象。
- 计算分组摘要统计,如计数、平均值、标准差,或用户自定义函数。
- 对DataFrame的列应用各种各样的函数。
- 应用组内转换或其他运算,如规格化、线性回归、排名或选取子集等。
- 计算透视表或交叉表。
- 执行分位数分析以及其他分组分析。
分组技术
Hadley Wickham(许多热门R语言包的作者)创造了用于表示分组运算的术语"split-apply-combine"(拆分-应用-合并),我觉得这个词很好地描述了整个过程。分组运算的第一个阶段,pandas对象(无论是Series、DataFrame还是其他的)中的数据会根据你所提供的一个或多个键被拆分(split)为多组。拆分操作是在对象的特定轴上执行的。例如,DataFrame可以在其行(axis=0)或列(axis=1)上进行分组。然后,将一个函数应用(apply)到各个分组并产生一个新值。最后,所有这些函数的执行结果会被合并(combine)到最终的结果对象中。结果对象的形式一般取决于数据上所执行的操作。图9-1大致说明了一个简单的分组聚合过程。
分组键可以有多种形式,且类型不必相同:
- 列表或数组,其长度与待分组的轴一样。
- 表示DataFrame某个列名的值。
- 字典或Series,给出待分组轴上的值与分组名之间的对应关系。
- 函数,用于处理轴索引或索引中的各个标签。
注意,后三种都只是快捷方式而已,其最终目的仍然是产生一组用于拆分对象的值。如果觉得这些东西看起来很抽象,不用担心,我将在本章中给出大量有关于此的示例。首先来看看下面这个非常简单的表格型数据集(以DataFrame的形式)。
#!python
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),'data2' : np.random.randn(5)})
df
Out[32]:
data1 data2 key1 key2
0 -0.592555 0.537886 a one
1 0.286764 1.498792 a two
2 -0.149658 0.847675 b one
3 0.961803 -1.218945 b two
4 0.896790 1.461441 a one
假设你想要按key1进行分组,并计算data1列的平均值。实现该功能的方式有很多,而我们这里要用的是:访问data1,并根据key1调用groupby。
#!python
grouped = df['data1'].groupby(df['key1'])
grouped
Out[34]: <pandas.core.groupby.SeriesGroupBy object at 0x000001937BF46E48>
变量grouped是一个GroupBy对象。它实际上还没有进行任何计算,只是含有一些有关分组键df['key1']的中间数据而已。换句话说,该对象已经有了接下来对各分组执行运算所需的一切信息。例如,我们可以调用GroupBy的mean方法来计算分组平均值:
#!python
grouped.mean()
Out[35]:
key1
a 0.197000
b 0.406073
Name: data1, dtype: float64
数据(Series)根据分组键进行了聚合,产生了新的Series,其索引为key1列中的唯一值。
如果我们一次传入多个数组,就会得到不同的结果:
#!python
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
Out[38]:
key1 key2
a one 0.152117
two 0.286764
b one -0.149658
two 0.961803
Name: data1, dtype: float64
means.unstack()
Out[39]:
key2 one two
key1
a 0.152117 0.286764
b -0.149658 0.961803
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
Out[42]:
California 2005 0.286764
2006 -0.149658
Ohio 2005 0.184624
2006 0.896790
Name: data1, dtype: float64
更常用的是列名(可以是字符串、数字或其他Python对象)用作分组键:
#!python
df.groupby('key1').mean()
Out[44]:
data1 data2
key1
a 0.197000 1.166040
b 0.406073 -0.185635
df.groupby(['key1', 'key2']).mean()
Out[45]:
data1 data2
key1 key2
a one 0.152117 0.999663
two 0.286764 1.498792
b one -0.149658 0.847675
two 0.961803 -1.218945
df.groupby(['key1', 'key2']).size()
Out[46]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
你可能已经注意到在执行df.groupby('key1').mean()时,结果中没有key2列。这是因为df['key2']不是数值数据(俗称“麻烦列”),所以被从结果中排除了。默认情况下,所有数值列都会被聚合,虽然有时可能会被过滤为一个子集(稍后就会讲到)。分组键中的任何缺失值都会被排除在结果之外。
- 分组迭代
GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看看下面这个简单的数据集:
#!python
for name, group in df.groupby('key1'):
print(name)
print(group)
a
data1 data2 key1 key2
0 -0.592555 0.537886 a one
1 0.286764 1.498792 a two
4 0.896790 1.461441 a one
b
data1 data2 key1 key2
2 -0.149658 0.847675 b one
3 0.961803 -1.218945 b two
对于多重键的情况,元组的第一个元素将会是由键值组成的元组: