Pandas数据规整 - 转换
In [1]:
import numpy as np
import pandas as pd
Pandas数据排序
.sort_index() 在指定轴上根据索引进行排序,索引排序后内容会跟随排序
In [2]:
b = pd.DataFrame(np.arange(20).reshape(4,5),index=['c','a','d','b'])
b
Out[2]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
c | 0 | 1 | 2 | 3 | 4 |
a | 5 | 6 | 7 | 8 | 9 |
d | 10 | 11 | 12 | 13 | 14 |
b | 15 | 16 | 17 | 18 | 19 |
sort_index() 按索引排序
In [3]:
b.sort_index() # 默认按行索引排序,默认升序
Out[3]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
a | 5 | 6 | 7 | 8 | 9 |
b | 15 | 16 | 17 | 18 | 19 |
c | 0 | 1 | 2 | 3 | 4 |
d | 10 | 11 | 12 | 13 | 14 |
In [4]:
b.sort_index(axis=1, ascending=False) # 按列索引排序,降序
Out[4]:
4 | 3 | 2 | 1 | 0 | |
---|---|---|---|---|---|
c | 4 | 3 | 2 | 1 | 0 |
a | 9 | 8 | 7 | 6 | 5 |
d | 14 | 13 | 12 | 11 | 10 |
b | 19 | 18 | 17 | 16 | 15 |
sort_values() 按值排序
In [5]:
dates = pd.date_range('20130101', periods = 10)
dates
df = pd.DataFrame(np.random.randn(10,4), index = dates, columns = ['A','B','C','D'])
df
Out[5]:
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 1.445132 | -1.833477 | -0.069886 | 0.657315 |
2013-01-02 | -0.141620 | 1.018739 | -0.321325 | 0.317470 |
2013-01-03 | -2.019614 | 0.401244 | 2.292575 | -0.255049 |
2013-01-04 | -0.183211 | 0.291449 | 1.817465 | 1.086226 |
2013-01-05 | 1.796471 | -0.178786 | -0.110507 | -0.585961 |
2013-01-06 | 0.307960 | -0.205638 | 0.814343 | 0.416364 |
2013-01-07 | -0.981729 | 1.837534 | -0.961424 | 0.059630 |
2013-01-08 | -0.151302 | 0.367162 | -0.216039 | -0.664508 |
2013-01-09 | 1.476126 | 0.699722 | 0.738357 | 0.177397 |
2013-01-10 | -0.672169 | -0.641054 | 1.061853 | 0.262266 |
In [7]:
# 默认按行排序(这一列的所有行)
df.sort_values(by='A') # 指定排序基准列
df.sort_values(by='A', ascending=False) # 倒序
Out[7]:
A | B | C | D | |
---|---|---|---|---|
2013-01-05 | 1.796471 | -0.178786 | -0.110507 | -0.585961 |
2013-01-09 | 1.476126 | 0.699722 | 0.738357 | 0.177397 |
2013-01-01 | 1.445132 | -1.833477 | -0.069886 | 0.657315 |
2013-01-06 | 0.307960 | -0.205638 | 0.814343 | 0.416364 |
2013-01-02 | -0.141620 | 1.018739 | -0.321325 | 0.317470 |
2013-01-08 | -0.151302 | 0.367162 | -0.216039 | -0.664508 |
2013-01-04 | -0.183211 | 0.291449 | 1.817465 | 1.086226 |
2013-01-10 | -0.672169 | -0.641054 | 1.061853 | 0.262266 |
2013-01-07 | -0.981729 | 1.837534 | -0.961424 | 0.059630 |
2013-01-03 | -2.019614 | 0.401244 | 2.292575 | -0.255049 |
In [8]:
# 按列排序(一行的所有列)
df.sort_values(axis=1, by='2013-01-01', ascending=False)
Out[8]:
A | D | C | B | |
---|---|---|---|---|
2013-01-01 | 1.445132 | 0.657315 | -0.069886 | -1.833477 |
2013-01-02 | -0.141620 | 0.317470 | -0.321325 | 1.018739 |
2013-01-03 | -2.019614 | -0.255049 | 2.292575 | 0.401244 |
2013-01-04 | -0.183211 | 1.086226 | 1.817465 | 0.291449 |
2013-01-05 | 1.796471 | -0.585961 | -0.110507 | -0.178786 |
2013-01-06 | 0.307960 | 0.416364 | 0.814343 | -0.205638 |
2013-01-07 | -0.981729 | 0.059630 | -0.961424 | 1.837534 |
2013-01-08 | -0.151302 | -0.664508 | -0.216039 | 0.367162 |
2013-01-09 | 1.476126 | 0.177397 | 0.738357 | 0.699722 |
2013-01-10 | -0.672169 | 0.262266 | 1.061853 | -0.641054 |
关于排序中的缺失值问题
排序不论升序降序,缺失值永远排在最后
In [9]:
a = pd.DataFrame(np.arange(12).reshape(3,4), index=['a','b','c'])
a
b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c','a','d','b'])
b
c = a + b
c
Out[9]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
a | 5.0 | 7.0 | 9.0 | 11.0 | NaN |
b | 19.0 | 21.0 | 23.0 | 25.0 | NaN |
c | 8.0 | 10.0 | 12.0 | 14.0 | NaN |
d | NaN | NaN | NaN | NaN | NaN |
In [11]:
c.sort_values(by=0) # 升序,缺失值在最后
c.sort_values(by=0, ascending=False) # 降序,缺失值还在最后
Out[11]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
b | 19.0 | 21.0 | 23.0 | 25.0 | NaN |
c | 8.0 | 10.0 | 12.0 | 14.0 | NaN |
a | 5.0 | 7.0 | 9.0 | 11.0 | NaN |
d | NaN | NaN | NaN | NaN | NaN |
随机排列和随机采样
随机排列
利用numpy.random.permutation函数可以实现对Series或DataFrame的列的随机排序工作(permuting,随机重排序)
通过需要排列的轴的长度调用permutation,可产生一个表示新顺序的整数数组:
In [23]:
# 随机排列序列
a = [1,2,3,4,5,6,7]
a
np.random.permutation(a)
Out[23]:
array([6, 3, 4, 7, 5, 1, 2])
In [24]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df
Out[24]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
In [25]:
df.index
df.columns
Out[25]:
RangeIndex(start=0, stop=4, step=1)
In [26]:
# 打乱行索引
np.random.permutation(df.index)
Out[26]:
array([4, 3, 2, 1, 0], dtype=int64)
In [38]:
df.loc[[2,3,4,1,0]]
df.loc[np.random.permutation(df.index)]
Out[38]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
2 | 8 | 9 | 10 | 11 |
4 | 16 | 17 | 18 | 19 |
0 | 0 | 1 | 2 | 3 |
3 | 12 | 13 | 14 | 15 |
1 | 4 | 5 | 6 | 7 |
随机重排行索引和列索引
In [47]:
index = np.random.permutation(df.index)
index
columns = np.random.permutation(df.columns)
columns
df.loc[index, columns]
Out[47]:
0 | 3 | 2 | 1 | |
---|---|---|---|---|
0 | 0 | 3 | 2 | 1 |
1 | 4 | 7 | 6 | 5 |
3 | 12 | 15 | 14 | 13 |
4 | 16 | 19 | 18 | 17 |
2 | 8 | 11 | 10 | 9 |
随机采样
choice(),从一个序列中随机抽取某些值
In [95]:
a = [1,2,3,4,5,6,7]
np.random.choice(a)
np.random.choice(a, size=3) # 放回抽样
np.random.choice(a, size=3, replace=False) # 不放回抽样,不重复抽取
Out[95]:
array([1, 7, 6])
随机采样
In [96]:
df
Out[96]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
In [106]:
# 按行随机采样
df.index
index = np.random.choice(df.index, size=3, replace=False)
df.loc[index]
Out[106]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
3 | 12 | 13 | 14 | 15 |
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
In [118]:
# 按行、按列随机采样
df.index
df.columns
index = np.random.choice(df.index, size=3, replace=False)
columns = np.random.choice(df.columns, size=2, replace=False)
df.loc[index, columns]
Out[118]:
0 | 3 | |
---|---|---|
2 | 8 | 11 |
1 | 4 | 7 |
3 | 12 | 15 |
重新索引(修改索引)
reindex() ,重新索引,创建一个适应新索引的新对象
一种变相的查询方式,类似在查询中加入新行新列
- rename是将原索引某些值替换为新值
- reindex则是将整个索引重建(并不替换索引值,而是增减索引或改变顺序,原索引对应的值关系不变)
In [119]:
obj = pd.Series([4.5,7,2,-5.3], index = ['d','b','a','c'])
obj
Out[119]:
d 4.5
b 7.0
a 2.0
c -5.3
dtype: float64
In [120]:
obj.index
obj.index.values
Out[120]:
array(['d', 'b', 'a', 'c'], dtype=object)
直接修改索引
有问题,不合适
- 索引修改后,值没有跟着变化
- 修改值必须和原索引长度保持一致,不能增加或删除索引
In [121]:
obj.index = ['a','b','c','d']
obj
Out[121]:
a 4.5
b 7.0
c 2.0
d -5.3
dtype: float64
In [230]:
# obj.index = ['a','b','c'] # 长度不一致,报错
使用rename修改索引
In [122]:
obj
Out[122]:
a 4.5
b 7.0
c 2.0
d -5.3
dtype: float64
In [123]:
obj.rename({'a': 'aa', 'b': 'ccc', 'x': 'xx'})
Out[123]:
aa 4.5
ccc 7.0
c 2.0
d -5.3
dtype: float64
正规做法:使用reindex()修改索引
In [125]:
obj
Out[125]:
a 4.5
b 7.0
c 2.0
d -5.3
dtype: float64
In [126]:
obj.reindex(['b','d','a','c','e'])
Out[126]:
b 7.0
d -5.3
a 4.5
c 2.0
e NaN
dtype: float64
In [127]:
# 新增的缺失值,填充默认值
obj.reindex(['b','d','a','c','e'], fill_value=0)
Out[127]:
b 7.0
d -5.3
a 4.5
c 2.0
e 0.0
dtype: float64
实际应用
In [128]:
obj2 = pd.Series(['blue','purple','yellow'], index = [0,2,4])
obj2
Out[128]:
0 blue
2 purple
4 yellow
dtype: object
In [129]:
range(6)
Out[129]:
range(0, 6)
In [130]:
obj2.reindex(range(6))
Out[130]:
0 blue
1 NaN
2 purple
3 NaN
4 yellow
5 NaN
dtype: object
In [131]:
# 填充缺失值,指定填充值
obj2.reindex(range(6), fill_value=0)
Out[131]:
0 blue
1 0
2 purple
3 0
4 yellow
5 0
dtype: object
In [133]:
# 前向,后向填充
obj2.reindex(range(6), method='ffill')
obj2.reindex(range(6), method='bfill')
Out[133]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object
DataFrame的索引重建
In [134]:
frame = pd.DataFrame(np.random.randint(0,100,(3,3)), index = ['语文','数学','英语'], columns = ['张三','李四','王五'])
frame
Out[134]:
张三 | 李四 | 王五 | |
---|---|---|---|
语文 | 27 | 77 | 36 |
数学 | 92 | 19 | 5 |
英语 | 34 | 95 | 87 |
In [136]:
# 重建行索引
frame.reindex(['语文', '数学', '英语'])
frame.reindex(['语文', '编程', '英语'])
Out[136]:
张三 | 李四 | 王五 | |
---|---|---|---|
语文 | 27.0 | 77.0 | 36.0 |
编程 | NaN | NaN | NaN |
英语 | 34.0 | 95.0 | 87.0 |
In [138]:
# 重建列索引
frame.reindex(['张三', '赵六', '王五'], axis=1)
frame.reindex(columns=['张三', '赵六', '王五'])
Out[138]:
张三 | 赵六 | 王五 | |
---|---|---|---|
语文 | 27 | NaN | 36 |
数学 | 92 | NaN | 5 |
英语 | 34 | NaN | 87 |
In [139]:
# 重建行列索引
frame.reindex(index=['语文', '编程', '英语'], columns=['张三', '赵六', '王五'])
Out[139]:
张三 | 赵六 | 王五 | |
---|---|---|---|
语文 | 27.0 | NaN | 36.0 |
编程 | NaN | NaN | NaN |
英语 | 34.0 | NaN | 87.0 |
In [140]:
frame.loc[['语文', '编程', '英语'], ['张三', '赵六', '王五']] # 警告,建议使用reindex方法
e:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
"""Entry point for launching an IPython kernel.
Out[140]:
张三 | 赵六 | 王五 | |
---|---|---|---|
语文 | 27.0 | NaN | 36.0 |
编程 | NaN | NaN | NaN |
英语 | 34.0 | NaN | 87.0 |
例2
In [141]:
frame.index = [1,2,4]
frame.columns = [1,2,3]
frame
Out[141]:
1 | 2 | 3 | |
---|---|---|---|
1 | 27 | 77 | 36 |
2 | 92 | 19 | 5 |
4 | 34 | 95 | 87 |
In [142]:
# 重建行列索引
frame.reindex(index=[1,2,10,4], columns=[3,1,5,2])
Out[142]:
3 | 1 | 5 | 2 | |
---|---|---|---|---|
1 | 36.0 | 27.0 | NaN | 77.0 |
2 | 5.0 | 92.0 | NaN | 19.0 |
10 | NaN | NaN | NaN | NaN |
4 | 87.0 | 34.0 | NaN | 95.0 |
In [252]:
# frame.loc[[1,2,10,4], [3,1,5,2]] # 弃用写法,使用上面写法
In [143]:
frame.reindex([1,2,10,4])
frame.reindex([1,2,10,4], method='ffill') # 前向填充,按实际行索引排序填充
Out[143]:
1 | 2 | 3 | |
---|---|---|---|
1 | 27 | 77 | 36 |
2 | 92 | 19 | 5 |
10 | 34 | 95 | 87 |
4 | 34 | 95 | 87 |
带有重复值的轴索引
许多Pandas函数要求标签唯一,但这不是强制的
In [144]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj
Out[144]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [145]:
obj['a']
Out[145]:
a 0
a 1
dtype: int64
In [258]:
obj['a'].values
obj['a'].values[0]
Out[258]:
0
In [146]:
# 索引的is_unique属性判断索引值是否唯一
obj.index.is_unique
Out[146]:
False