Pandas
Pandas 介绍
Pandas主要处理的数据结构
·系列(Series)
·数据帧(DataFrame)
·面板(Panel)
·这些数组都建立在Numpy上,所以执行速度非常快
维数和描述
·考虑这些数据结构的最好方法是,较高维数据结构是其较低维数据的容器
·例如DataFrame是Series的容器,Panel是DataFarme的容器
数据结构 | 维数 | 描述 |
---|---|---|
Series | 1 | 1D标记均为数组,大小不变 |
DataFrame | 2 | 一般2D标记,大小可变的表结构与潜在的异质类型的列 |
Panel | 3 | 一般3D标记,大小可变数组 |
可变性
·所有Pandas的数据结构是值可变的(可以修改),除了Series都是大小可变的。
·DataFrame被广泛使用,是最重要的数据结构之一。
·Panel面板数据结构使用少得多
Series 系列(序列)
Series是具有均匀数据的一位数组结构
·例如以下Series是整数:10,23,56...的集合
·关键点
·均匀数据
·尺寸大小不变
·数据的值可变
10 | 23 | 56 | 17 | 52 | 61 | 73 | 90 | 26 | 72 |
---|
DataFrame
·DataFrame是一个具有异构数据的二维数组。
Pandas使用入门
创建对象
·通过传递值列表来创建一个Series,让Pandas创建一个默认的整数Series:
import pandas as pd
import numpy as np
s1 = pd.Series(np.arange(5))
s2 = pd.Series([1,3,5,np.nan,6,8])
print(s1,s2)
0 0
1 1
2 2
3 3
4 4
dtype: int32 0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
·通过Numpy数组,使用datetime索引和标记列来创建DataFrame:
dates = pd.date_range('20190301',periods=7)
print(dates)
print('--'*25)
df = pd.DataFrame(np.random.randn(7,4),index = dates,columns = list('ABCD'))
print(df)
DatetimeIndex(['2019-03-01', '2019-03-02', '2019-03-03', '2019-03-04',
'2019-03-05', '2019-03-06', '2019-03-07'],
dtype='datetime64[ns]', freq='D')
--------------------------------------------------
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
通过传递可以转换为Series的对象的字典来创建Dataframe
df2 = pd.DataFrame({'A':1.,
'B':pd.Timestamp('20190302'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train','test','train']),
'F':'foo'})
print(df2)
A B C D E F
0 1.0 2019-03-02 1.0 3 test foo
1 1.0 2019-03-02 1.0 3 train foo
2 1.0 2019-03-02 1.0 3 test foo
3 1.0 2019-03-02 1.0 3 train foo
查看数据
head,tail 查看框架顶部和底部的数据行
print('head: \n',df.head())
print('-'*50)
print('Tail: \n',df.tail(3))
head:
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
--------------------------------------------------
Tail:
A B C D
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
index, columns,values显示索引,列和底层Numpy数据:
print('index is: ')
print(df.index)
print('columns is: ')
print(df.columns)
print('value is: ')
print(df.values)
index is:
DatetimeIndex(['2019-03-01', '2019-03-02', '2019-03-03', '2019-03-04',
'2019-03-05', '2019-03-06', '2019-03-07'],
dtype='datetime64[ns]', freq='D')
columns is:
Index(['A', 'B', 'C', 'D'], dtype='object')
value is:
[[-2.02041037 -0.92475674 -1.88864928 -0.05189268]
[-0.97632394 -0.68467222 -0.83701968 -0.77248437]
[ 0.3531265 -0.6524075 0.55787276 -0.67863676]
[ 0.13556278 0.09227419 -0.14895721 -2.05814846]
[-0.11702479 -0.20276259 0.56630908 -1.77536338]
[ 0.2537632 -0.20927471 -0.50362523 -0.3997636 ]
[-0.30706349 0.89749039 1.05679755 -0.90198211]]
describe,info描述显示数据的快速统计摘要
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7 entries, 2019-03-01 to 2019-03-07
Freq: D
Data columns (total 4 columns):
A 7 non-null float64
B 7 non-null float64
C 7 non-null float64
D 7 non-null float64
dtypes: float64(4)
memory usage: 280.0 bytes
print(df.describe())
A B C D
count 7.000000 7.000000 7.000000 7.000000
mean -0.382624 -0.240587 -0.171039 -0.948324
std 0.849108 0.611463 1.007256 0.721805
min -2.020410 -0.924757 -1.888649 -2.058148
25% -0.641694 -0.668540 -0.670322 -1.338673
50% -0.117025 -0.209275 -0.148957 -0.772484
75% 0.194663 -0.055244 0.562091 -0.539200
max 0.353127 0.897490 1.056798 -0.051893
df.T纵横坐标调换数据
df.T
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>2019-03-01 00:00:00</th>
<th>2019-03-02 00:00:00</th>
<th>2019-03-03 00:00:00</th>
<th>2019-03-04 00:00:00</th>
<th>2019-03-05 00:00:00</th>
<th>2019-03-06 00:00:00</th>
<th>2019-03-07 00:00:00</th>
</tr>
</thead>
<tbody>
<tr>
<th>A</th>
<td>-2.020410</td>
<td>-0.976324</td>
<td>0.353127</td>
<td>0.135563</td>
<td>-0.117025</td>
<td>0.253763</td>
<td>-0.307063</td>
</tr>
<tr>
<th>B</th>
<td>-0.924757</td>
<td>-0.684672</td>
<td>-0.652408</td>
<td>0.092274</td>
<td>-0.202763</td>
<td>-0.209275</td>
<td>0.897490</td>
</tr>
<tr>
<th>C</th>
<td>-1.888649</td>
<td>-0.837020</td>
<td>0.557873</td>
<td>-0.148957</td>
<td>0.566309</td>
<td>-0.503625</td>
<td>1.056798</td>
</tr>
<tr>
<th>D</th>
<td>-0.051893</td>
<td>-0.772484</td>
<td>-0.678637</td>
<td>-2.058148</td>
<td>-1.775363</td>
<td>-0.399764</td>
<td>-0.901982</td>
</tr>
</tbody>
</table>
</div>
df.sort_index()通过轴排序
print(df.sort_index(axis=1,ascending=False))#
D C B A
2019-03-01 -0.051893 -1.888649 -0.924757 -2.020410
2019-03-02 -0.772484 -0.837020 -0.684672 -0.976324
2019-03-03 -0.678637 0.557873 -0.652408 0.353127
2019-03-04 -2.058148 -0.148957 0.092274 0.135563
2019-03-05 -1.775363 0.566309 -0.202763 -0.117025
2019-03-06 -0.399764 -0.503625 -0.209275 0.253763
2019-03-07 -0.901982 1.056798 0.897490 -0.307063
print(df.sort_index(axis=0,ascending=False))
A B C D
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
print(df)
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
df.sort_values()通过值排序
print(df.sort_values('C'))#,ascending = False加上为降序,默认升序,或df.sort_values(by = 'C')
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
取数简单操作
获取某一列产生一个新的Series
a = df['A']
print(a)
print(type(a))
2019-03-01 -2.020410
2019-03-02 -0.976324
2019-03-03 0.353127
2019-03-04 0.135563
2019-03-05 -0.117025
2019-03-06 0.253763
2019-03-07 -0.307063
Freq: D, Name: A, dtype: float64
<class 'pandas.core.series.Series'>
通过[]操作符,选择切片行
print(df[0:3])# 定制几行到几行
print(df['2019-03-02':'2019-03-04'])# 定制某个值到某个值的范围
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
A B C D
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
df.loc[]使用标签获取横切面
print(df.loc[dates[0]])
A -2.020410
B -0.924757
C -1.888649
D -0.051893
Name: 2019-03-01 00:00:00, dtype: float64
df.loc[]通过标签选择多个轴
print(df.loc[:,['A','B']])
print(df.A)
A B
2019-03-01 -2.020410 -0.924757
2019-03-02 -0.976324 -0.684672
2019-03-03 0.353127 -0.652408
2019-03-04 0.135563 0.092274
2019-03-05 -0.117025 -0.202763
2019-03-06 0.253763 -0.209275
2019-03-07 -0.307063 0.897490
2019-03-01 -2.020410
2019-03-02 -0.976324
2019-03-03 0.353127
2019-03-04 0.135563
2019-03-05 -0.117025
2019-03-06 0.253763
2019-03-07 -0.307063
Freq: D, Name: A, dtype: float64
df.loc[] 显示标签切片,包括两个端点
print(df.loc['20190301':'20190305','A':'C'])
print(df.loc['20190301':'20190305',['A','B','C']]) #两种方式效果一样
A B C
2019-03-01 -2.020410 -0.924757 -1.888649
2019-03-02 -0.976324 -0.684672 -0.837020
2019-03-03 0.353127 -0.652408 0.557873
2019-03-04 0.135563 0.092274 -0.148957
2019-03-05 -0.117025 -0.202763 0.566309
A B C
2019-03-01 -2.020410 -0.924757 -1.888649
2019-03-02 -0.976324 -0.684672 -0.837020
2019-03-03 0.353127 -0.652408 0.557873
2019-03-04 0.135563 0.092274 -0.148957
2019-03-05 -0.117025 -0.202763 0.566309
df.loc[]/df.at获取某一个标量值
print(df.loc[dates[0],'A'])
print(df.at[dates[0],'A'])
-2.0204103737371066
-2.0204103737371066
df.iloc[]/df.iat[]通过位置来选择
print(df)
print('-'*50)
print('df.iloc[0,0]: \n',df.iloc[0,0])
print('-'*50)
print('df.iat[0,0]: \n',df.iat[0,0])#同上iloc[]
print('-'*50)
print('df.iloc[0]: ',df.iloc[0])
print('-'*50)
print('df.iloc[:,0]: \n',df.iloc[:,0])
print('-'*50)
print('df.iloc[0:2,3:5]: \n',df.iloc[0:2,3:5])
print('-'*50)
print('df.iloc[[0,2,3],[1,3,6]]: \n',df.iloc[[1,3,6],[0,2,3]])
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
--------------------------------------------------
df.iloc[0,0]:
-2.0204103737371066
--------------------------------------------------
df.iat[0,0]:
-2.0204103737371066
--------------------------------------------------
df.iloc[0]: A -2.020410
B -0.924757
C -1.888649
D -0.051893
Name: 2019-03-01 00:00:00, dtype: float64
--------------------------------------------------
df.iloc[:,0]:
2019-03-01 -2.020410
2019-03-02 -0.976324
2019-03-03 0.353127
2019-03-04 0.135563
2019-03-05 -0.117025
2019-03-06 0.253763
2019-03-07 -0.307063
Freq: D, Name: A, dtype: float64
--------------------------------------------------
df.iloc[0:2,3:5]:
D
2019-03-01 -0.051893
2019-03-02 -0.772484
--------------------------------------------------
df.iloc[[0,2,3],[1,3,6]]:
A C D
2019-03-02 -0.976324 -0.837020 -0.772484
2019-03-04 0.135563 -0.148957 -2.058148
2019-03-07 -0.307063 1.056798 -0.901982
布尔索引:使用单列Series的某个条件的值来选取数据
print(df[df.B>0])
A B C D
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
布尔索引:满足条件的DataFrame中选值:
print(df[df>0])
A B C D
2019-03-01 NaN NaN NaN NaN
2019-03-02 NaN NaN NaN NaN
2019-03-03 0.353127 NaN 0.557873 NaN
2019-03-04 0.135563 0.092274 NaN NaN
2019-03-05 NaN NaN 0.566309 NaN
2019-03-06 0.253763 NaN NaN NaN
2019-03-07 NaN 0.897490 1.056798 NaN
isin() 过滤数据(接收一个参数,元组或列表)
df2 = df.copy()
df2['E'] = ['one','two','three','four','five','six','seven']
print(df2)
print(df2[df2['E'].isin(('one','two'))])
A B C D E
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893 one
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484 two
2019-03-03 0.353127 -0.652408 0.557873 -0.678637 three
2019-03-04 0.135563 0.092274 -0.148957 -2.058148 four
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363 five
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764 six
2019-03-07 -0.307063 0.897490 1.056798 -0.901982 seven
A B C D E
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893 one
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484 two
修改DataFrame
添加列:字典,直接添加,Series+Series添加
d = {'one':pd.Series([1,2,3],index = ['a','b','c']),
'two':pd.Series([1,2,3,4],index = ['a','b','c','d'])}
df3 = pd.DataFrame(d)
print(df3)
df3['three'] = pd.Series([20,30,40],index = ['a','b','d'])
print(df3)
df3['four']=df3['one']+df3['three']
print(df3)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
one two three
a 1.0 1 20.0
b 2.0 2 30.0
c 3.0 3 NaN
d NaN 4 40.0
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 30.0 32.0
c 3.0 3 NaN NaN
d NaN 4 40.0 NaN
删除列:del,pop
pop 具有返回值,比如 a = pop df['three']会将删除的值赋予a
print(df)
del df['three']
print(df)
A B C D three
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893 NaN
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484 NaN
2019-03-03 0.353127 -0.652408 0.557873 -0.678637 NaN
2019-03-04 0.135563 0.092274 -0.148957 -2.058148 NaN
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363 NaN
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764 NaN
2019-03-07 -0.307063 0.897490 1.056798 -0.901982 NaN
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
添加行:append
df3
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>one</th>
<th>two</th>
<th>three</th>
<th>four</th>
</tr>
</thead>
<tbody>
<tr>
<th>a</th>
<td>1.0</td>
<td>1</td>
<td>20.0</td>
<td>21.0</td>
</tr>
<tr>
<th>b</th>
<td>2.0</td>
<td>2</td>
<td>30.0</td>
<td>32.0</td>
</tr>
<tr>
<th>c</th>
<td>3.0</td>
<td>3</td>
<td>NaN</td>
<td>NaN</td>
</tr>
<tr>
<th>d</th>
<td>NaN</td>
<td>4</td>
<td>40.0</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div>
df4 = pd.DataFrame([[5,6,7,8,9,10]],columns = ['one','two','three','four','five','six'],index=['d'])
print(df3.append(df4))#没有改变df3,只是重新生成了一个新的DataFrame
print('-'*50)
print(df3)
five four one six three two
a NaN 21.0 1.0 NaN 20.0 1
b NaN 32.0 2.0 NaN 30.0 2
c NaN NaN 3.0 NaN NaN 3
d NaN NaN NaN NaN 40.0 4
d 9.0 8.0 5.0 10.0 7.0 6
--------------------------------------------------
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 30.0 32.0
c 3.0 3 NaN NaN
d NaN 4 40.0 NaN
删除行:df.drop()使用索引标签从DataFrame中删除行
如果行标签是重复的则会删除多行。
删除行不会操作原表,只有重新赋值才能修改
print(df3)
print('-'*50)
print(df3.drop('d'))
print('-'*50)
print(df3)
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 30.0 32.0
c 3.0 3 NaN NaN
d NaN 4 40.0 NaN
--------------------------------------------------
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 30.0 32.0
c 3.0 3 NaN NaN
--------------------------------------------------
one two three four
a 1.0 1 20.0 21.0
b 2.0 2 30.0 32.0
c 3.0 3 NaN NaN
d NaN 4 40.0 NaN
Python Pandas入门操作小结
编号 | 属性或方法 | 描述 |
---|---|---|
1 | T | 转置行和列 |
2 | axes | 返回一个列或行,行轴标签和列轴标签作为唯一成员 |
3 | dtypes | 返回此对象中的数据类型(dtypes) |
4 | empty | 如果NDFrame完全为空(无项目),则返回True |
5 | ndim | 轴/数组维度大小 |
6 | shape | 返回表示DataFrame的维度的数组 |
7 | size | NDFrame中的元素数 |
8 | values | NDFrame中的Numpy表示 |
9 | head() | 返回开头n行,默认5行 |
10 | tail() | 返回末尾n行 |
print(df)
print('-'*50)
print(df.axes)
print('-'*50)
print(df.dtypes)
print('-'*50)
print(df.empty)
print('-'*50)
print(df.ndim)
print('-'*50)
print(df.shape)
print('-'*50)
print(df.values)
print('-'*50)
A B C D
2019-03-01 -2.020410 -0.924757 -1.888649 -0.051893
2019-03-02 -0.976324 -0.684672 -0.837020 -0.772484
2019-03-03 0.353127 -0.652408 0.557873 -0.678637
2019-03-04 0.135563 0.092274 -0.148957 -2.058148
2019-03-05 -0.117025 -0.202763 0.566309 -1.775363
2019-03-06 0.253763 -0.209275 -0.503625 -0.399764
2019-03-07 -0.307063 0.897490 1.056798 -0.901982
--------------------------------------------------
[DatetimeIndex(['2019-03-01', '2019-03-02', '2019-03-03', '2019-03-04',
'2019-03-05', '2019-03-06', '2019-03-07'],
dtype='datetime64[ns]', freq='D'), Index(['A', 'B', 'C', 'D'], dtype='object')]
--------------------------------------------------
A float64
B float64
C float64
D float64
dtype: object
--------------------------------------------------
False
--------------------------------------------------
2
--------------------------------------------------
(7, 4)
--------------------------------------------------
[[-2.02041037 -0.92475674 -1.88864928 -0.05189268]
[-0.97632394 -0.68467222 -0.83701968 -0.77248437]
[ 0.3531265 -0.6524075 0.55787276 -0.67863676]
[ 0.13556278 0.09227419 -0.14895721 -2.05814846]
[-0.11702479 -0.20276259 0.56630908 -1.77536338]
[ 0.2537632 -0.20927471 -0.50362523 -0.3997636 ]
[-0.30706349 0.89749039 1.05679755 -0.90198211]]
--------------------------------------------------