Pandas库

分析数据

1. Pandas库入门

提供高性能易用数据类型和分析工具

1.1 Pandas简介

安装

pip install -i https://mirrors.aliyun.com/pypi/simple/ Panda

调用

import pandas as pd
a = pd.Series(range(3))
a
# Out[]:
0      0
1      1
2      2

索引：左边一列0-2
值：右边的0-2
dtype：类型是int64

Pandas库的理解

NumPy实现的扩展库，常与NumPy和Matplotlib一同使用
2个数据类型：Series（一维）、DataFrame（二维及其以上）
提供基于上述数据类型的各种操作：基本操作、运算操作、特征类操作、关联类操作

NumPy	Pandas
基础数据类型`ndarray`	扩展数据类型`Series` `DataFrame`
关注数据的结构表达（数据间的维度）	关注数据应用表达
维度：数据间关系	数据与索引间关系

1.2 `Series`

Series类型由一组数据及与之相关的数据索引组成

1.2.1 Series类型的创建

Python列表创建

index与列表元素个数一致

import pandas as pd
# 构造一个Series对象
a = pd.Series([3, 4])
a
# OUT[]:
0    3
1    4
dtype: int64

索引0-3是自动索引
数据类型是沿用NumPy中的数据类型

b = pd.Series([3, 4], index = ['a', 'b'])
b
# Out[]:
a    3
b    4
dtype: int64

用index=自定义索引，作为第二个参数，可以省略

标量创建

index表达Series类型的尺寸

pd.Series(25, index = [0, 1, 2])
# Out[]:
0    25
1    25
2    25
dtype: int64

此时index =不能省略

字典类型创建

键值对对应索引和值，并可通过index从字典中进行选择操作

pd.Series({'a' : 2, 'b' : 3})
# Out[]:
a    2
b    3
dtype: int64

# 调整索引位置
# 可看做index从字典中选择操作
pd.Series({'a' : 2, 'b' : 3}, index= ['b', 'c', 'a'])
# Out[]:
b    3.0
c    NaN
a    2.0
dtype: float64

ndarray创建

import numpy as np
pd.Series(np.arange(3))
# Out[]:
0    0
1    1
2    2
dtype: int32

1.2.2 Series类型的基本操作

Series类型包含index和values两部分

Series类型的操作类似ndarray类型

索引方法相同，采用[]
NumPy中的运算和操作可用于Series类型
可通过自定义索引的列表进行切片
通过自动索引进行切片

a = pd.Series([2, 5, 7,4], index=['a', 'b', 'c', 'd'])
# 获得索引
a.index # Index(['a', 'b', 'c', 'd'], dtype='object')
# 获得值
a.values # array([2, 5, 7, 4], dtype=int64)

Series中的index是Index类型，values是NumPy的类型
即Series是将NumPy类型作为保留值，并关联一个index的类型

a['a'] # 2
a[0] # 2 【单个索引，只得到相应值】

虽然可以自定义index，但是还是可以通过默认index索引到

a[['a', 'b', 2]] # Traceback (most recent call last)【报错】
a[['a', 'b', 'c']]
# a    2
# b    5
# c    7
# dtype: int64

多个索引，得到Series类型数据
自定义索引和默认索引不能同时使用

a[:3]
# a    2
# b    5
# c    7
# dtype: int64

切片，获得索引为0-3的Series类型数据（到3停止，不包括3）

# 用比较关系型索引
a[a > a.median()]
# b    5
c    7
# dtype: int64

获得所有大于中位数的项，组成一个Series类型

a**2
# a     4
# b    25
# c    49
# d    16
# dtype: int64

NumPy中的运算和操作可用于Series类型

Series类型的操作类似Python字典类型

通过自定义索引访问
使用保留字in
使用.get()方法

a = pd.Series([2, 5, 7,4], index=['a', 'b', 'c', 'd'])

'c' in b # True
0 in b # False

a.get('f', 100) # 100
a.get('a', 100) # 2

in判断的是某字符串在不在字典的键中，这里判断字符在不在Series类型的索引中，自定义索引不会被判断
get()指从a中提取索引f对应的值，如果没有索引f，就返回后面的100

Series对齐操作

Series类型在运算中会自动对齐不同索引的数据

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
b = pd.Series([6, 8, 9], index=['c', 'a', 'f'])
a + b
# a    10.0
# b     NaN
# c    13.0
# d     NaN
# f     NaN
# dtype: float64

当元素个数和索引有所区别时，进行相加，索引相同的相加，不同的不进行运算，值为空（NaN）

1.2.3 Series类型的name属性

Series对象和索引都可以有一个名字，存储在.name中

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
a.name = 'Series对象'
a.index.name = '索引号'

a
# 索引号
# a    2
# b    5
# c    7
# d    4
# Name: Series对象, dtype: int64

会在输出结果中加上.name的信息

1.2.4 Series类型的修改

Series对象可以随时修改并立即生效

a = pd.Series([2, 5, 7, 4], index=['a', 'b', 'c', 'd'])
a['a', 'b'] = 20 ,22
a['a'] # 20
a['b'] # 22

a['a', 'b'] = 20
a['a'] # 20
a['b'] # 20

理解Series类型，主要是理解Series是一维带“标签”数组。基本操作类似ndarray和字典，根据索引对齐

1.3 DataFrame

DataFrame类型由共用相同索引的一组数列组成

表格型的数据类型，每一列的数据类型可以相同也可以不同
有行索引（index）和列索引（column）
可以表达二维数据，也可表示多维数据

1.3.1 DataFrame类型创建

二维ndarray对象

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(10).reshape(2,5))
a
#   0   1   2   3   4
# 0 0   1   2   3   4
# 1 5   6   7   8   9

在行和列上分别自动生成索引

由维ndarray对象、列表、字典、元组或Series构成的字典创建

azd = {'one': pd.Series([1, 2, 3], index = ['a', 'b', 'c']),
      'two': pd.Series([6, 7, 8, 9], index = ['a', 'b', 'c'， ‘d])}
a = pd.DataFrame(azd)
a
#   one two
# a 1   6
# b 2   7
# c 3   8
# d NaN 9

在字典中的键，默认成为column，并自动补齐（NaN）

pd.DataFrame(a, index=['b', 'c', 'd'], columns=['two', 'there'])
#   two there
# b 7   NaN
# c 8   NaN
# d 9   NaN

存在的元素被提取出来，没有值的数据自动补齐

alb = {'one': [1, 2, 3, 4], 'two': [6, 7, 8, 9]}
a = pd.DataFrame(alb, index=['a', 'b', 'c', 'd'])
a
#   one two
# a 1   6
# b 2   7
# c 3   8
# d 4   9

自动中值的列表的元素个数必须相同，如果不同会报错，不会自动补齐
另外，生成的DataFrame`类型列的顺序，并不一定和字典给出的顺序一样

Series类型创建
其他DataFrame类型创建

1.3.2 DataFrame类型的基本操作

a.index # Index(['a', 'b', 'c', 'd'], dtype='object')
a.columns # Index(['one', 'two'], dtype='object')

# 获得列，直接用列索引
a['two']
# a    6
# b    7
# c    8
# d    9
# Name: two, dtype: int64

# 获得行
a.loc['a']
# one    1
# two    6
# Name: a, dtype: int64

# 获得某个值
a['one']['a'] # 1

# 直接修改
a['one']['a']  = 2
a
#   one two
# a 2   6
# b 2   7
# c 3   8
# d 4   9

DataFrame的索引都是Index类型

理解DataFrame类型，主要是理解二维带“标签”数组，基本操作类似Series，依据行列索引

1.4 类型操作

1.4.1 Pandas数据类型增加或重排：.reinex()

import pandas as pd
azd = {'城市' : ['北京', '上海', '广州', '长沙'],
    '环比': [1, 2, 3, 4],
    '地基': [5, 6, 7, 8],
    '同比': [9, 10, 11, 12]}
a = pd.DataFrame(azd, index = ['c1', 'c2', 'c3', 'c4'])
a
#     城市    环比  地基  同比
# c1    北京  1   5   9
# c2    上海  2   6   10
# c3    广州  3   7   11
# c4    长沙  4   8   12

# 调整行的排列
a.reindex(index=['c3', 'c2', 'c1', 'c4'])
#      城市   环比  地基  同比
# c3    广州  3   7   11
# c2    上海  2   6   10
# c1    北京  1   5   9
# c4    长沙  4   8   12

# 调整列的排列
a.reindex(columns=['城市', '地基', '同比','环比'])
#      城市   地基  同比  环比
# c1    北京  5   9   1
# c2    上海  6   10  2
# c3    广州  7   11  3
# c4    长沙  8   12  4

`.reindex()`参数	说明
`index` `columns`	新的行列自定义索引
`fill_value`	重新索引中，用于填充缺失位置的值
`method`	填充方法，`ffill`当前值向前填充，`bfill`向后填充
`limit`	最大填充量
`copy`	默认`True`，生成新的对象，`False`新旧相等不复制

# a.columns是一个列表，用列表的方法，加一个元素（修改索引方法，如下表）
columns_new = a.columns.insert(4, '新增')
columns_new # Index(['城市', '环比', '地基', '同比', '新增'], dtype='object')

b = a.reindex(columns=columns_new, fill_value=10)
b # 此时a不变
#     城市    环比  地基  同比  新增
# c1    北京  1   5   9   10
# c2    上海  2   6   10  10
# c3    广州  3   7   11  10
# c4    长沙  4   8   12  10

Series DataFrame的索引都是Index类型

索引类型常用方法	说明
`.append(idx)`	连接另一个Index对象，产生新的Index对象
`.diff(idx)`	计算差集，参数新的Index对象
`intersection(idx)`	计算交集
`.union(idx)`	计算并集
`.delete(loc)`	删除loc位置处的元素，并生成新的Index对象（列表操作）
`.insert(loc, 'e')`	在loc位置增加一个元素e（列表操作）

ind_new = a.index.insert(4, 'c5')
ind_new # Index(['c1', 'c2', 'c3', 'c4', 'c5'], dtype='object')

b = a.reindex(index=ind_new, method='ffill')
b
#       城市  环比  地基  同比
# c1    北京  1   5   9
# c2    上海  2   6   10
# c3    广州  3   7   11
# c4    长沙  4   8   12
# c5    长沙  4   8   12

1.4.2 删除：drop

# 删除行
a.drop(['c1', 'c2'])
#     城市    环比  地基  同比
# c3    广州  3   7   11
# c4    长沙  4   8   12

# 删除列，给出维度信息，在第二维度操作
a.drop(['环比', '同比'], axis = 1)
#     城市    地基
# c1    北京  5
# c2    上海  6
# c3    广州  7
# c4    长沙  8

drop默认操作第一维度元素（axis = 0）

1.5 数据运算

同一维度补齐；不同维度广播

1.5.1 Pandas数据类型的算数运算

根据行列索引，补齐后运算，运算默认产生浮点数
补齐默认填充NaN
二维和一维、一维和零维间为广播运算

同维运算：补齐后运算

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
a
#   0   1   2   3
# 0 0   1   2   3
# 1 4   5   6   7
# 2 8   9   10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

a + b # 【补齐后运算】
#   0   1   2   3   4
# 0 0.0 2.0 4.0 6.0 NaN
# 1 9.0 11.0    13.0    15.0    NaN
# 2 18.0    20.0    22.0    24.0    NaN
# 3 NaN NaN NaN NaN NaN

方法形式运算	说明
`.add(d,**argws)`	类型间加法运算，可选参数
`.sub(d,**argws)`	类型间减法运算，可选参数
`.mul(d,**argws)`	类型间乘法运算，可选参数
`.div(d,**argws)`	类型间除法运算，可选参数

a.add(b, fill_value= 100)
#   0   1   2   3   4
# 0 0.0 2.0 4.0 6.0 104.0
# 1 9.0 11.0    13.0    15.0    109.0
# 2 18.0    20.0    22.0    24.0    114.0
# 3 3   115.0   116.0   117.0   118.0   119.0

a被先自动补齐100后，在加上b的对应值
fill_value参数替代NaN参加运算。只能使用上述方法形操作才能获得的效果

广播运算：不同维度运算，低维作用到高纬每个元素间

c = pd.Series(np.arange(3))
c
# 0    0
# 1    1
# 2    2
# dtype: int32

c - 10
# 0   -10
# 1    -9
# 2    -8
# 3 dtype: int32

二维减去一维：每个第二维度减去c，自动补齐用NaN

a
#   0   1   2   3
# 0 0   1   2   3
# 1 4   5   6   7
# 2 8   9   10  11

a - c
#   0   1   2   3
# 0 0.0 0.0 0.0 NaN
# 1 4.0 4.0 4.0 NaN
# 2 8.0 8.0 8.0 NaN

c - a
#   0   1   2   3
# 0 0.0 0.0 0.0 NaN
# 1 -4.0    -4.0    -4.0    NaN
# 2 -8.0    -8.0    -8.0    NaN

运算默认发生在第二维度（axis=1）

a.sub(c, axis = 0)
#   0   1   2   3
# 0 0   1   2   3
# 1 3   4   5   6
# 2 6   7   8   9

1.5.2 Pandas数据类型的比较运算

只能比较相同索引的元素，不进行补齐
二维和一维、一维和零维间为广播运算
采用> < >= <= == !=等符号进行的二元运算产生布尔对象

同维运算，尺寸一致

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
b = pd.DataFrame(np.arange(12, 0, -1).reshape(3,4))
b
#   0   1   2   3
# 0 12  11  10  9
# 1 8   7   6   5
# 2 4   3   2   1

a > b
#   0   1   2   3
# 0 False   False   False   False
# 1 False   False   False   True
# 2 True    True    True    True

不同维度，广播运算，默认axis=1

c = pd.Series(np.arange(3))

a > c
#   0   1   2   3
# 0 False   False   False   False
# 1 True    True    True    False
# 2 True    True    True    False

c > 0
# 0    False
# 1     True
# 2     True
# dtype: bool

自动补齐都返回False

2. 数据特征分析

2.1 数据的排序

.sort_index()方法在指定轴上根据索引进行排序，默认升序（排序后相应值跟随）

对索引进行排序，而不是数据

.sort_index(axis = 0, ascending = True)

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])

a.sort_index()
#   b   d   a   c
# a 8   9   10  11
# b 0   1   2   3
# c 4   5   6   7

a.sort_index(axis=1)
#   a   b   c   d
# b 2   0   3   1
# c 6   4   7   5
# a 10  8   11  9

.sort_values()方法，在指定轴上，根据数值进行排序，默认升序（排序后，相应索引跟随）
Series.sort_values(axis = 0, ascending = True)
DataFrame.sort_values(by, axis = 0, ascending = True)

by：axis轴上的某个索引或索引列表

a.sort_values('a', axis = 1, ascending=False)
#   c   a   d   b
# b 3   2   1   0
# c 7   6   5   4
# a 11  10  9   8

NaN统一放到排序末尾

2.2 统计分析

统计方法	说明
`.sum()`	计算数据的总和，axis=0（下同）
`.max()` `.mix()`	计算数据的最大值、最小值
`.mean()` `.median()`	计算数据的算数平均值、算数中位数
`.var()` `.std()`	计算数据的方差、标准差
`.count()`	非NaN值的数量

方法类似于NumPy
适用于Series和DataFrame

统计方法	说明
`.argmin()` `.argmax()`	计算数据最大值、最小值所在位置的索引位置（自动索引）
`.idmin()` `.idmax()`	计算数据最大值、最小值所在位置的索引（自定义索引）

只适用Series的统计方法

.describe()：针对axis = 0（列）的统计汇总

import pandas as pd
a = pd.Series([3, 5, 6, 6], index = ['a', 'b', 'c', 'd'])

a.describe()
# count    4.000000
# mean     5.000000
# std      1.414214
# min      3.000000
# 25%      4.500000
# 50%      5.500000
# 75%      6.000000
# max      6.000000
# dtype: float64

.describe()将统计结果一次性输出

type(a.describe()) # pandas.core.series.Series
a.describe()['max'] # 6.0

Series类型的.describe()的输出类型是Series类型，则可使用Series类型方法获得相关信息

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])
a.describe()
#       b   d   a   c
# count 3.0 3.0 3.0 3.0
# mean  4.0 5.0 6.0 7.0
# std   4.0 4.0 4.0 4.0
# min   0.0 1.0 2.0 3.0
# 25%   2.0 3.0 4.0 5.0
# 50%   4.0 5.0 6.0 7.0
# 75%   6.0 7.0 8.0 9.0
# max   8.0 9.0 10.0    11.0

type(a.describe()) # pandas.core.frame.DataFrame
# 获得b列
a.describe()['b']
# 获得'count'行
a.describe().loc['count']

2.3 累计统计

累计统计分析：对前n个数进行累计运算

累计统计函数	说明
`.cumsum()`	一次给出前1， 2，······，n个数的和（包括n，下同）
`.cumprod()`	一次给出前1， 2，······，n个数的积
`.cummax()`	一次给出前1， 2，······，n个数的最大数
`.cummin()`	一次给出前1， 2，······，n个数的最小数

适用于Series和DataFrame

import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4), index = ['b', 'c', 'a'], columns = ['b', 'd', 'a', 'c'])
a.cumsum()
#   b   d   a   c
# b 0   1   2   3
# c 4   6   8   10
# a 12  15  18  21

axis默认为0，累计从0开始

滚动计算（窗口计算）	说明
`.rolling(w).sum()`	依次计算相邻w个元素的和
`.rolling(w).mean()`	依次计算相邻w个元素的算数平均数
`.rolling(w).var()`	依次计算相邻w个元素的方差
`.rolling(w).std()`	依次计算相邻w个元素的标准差
`.rolling(w).max().min()`	依次计算相邻w个元素的最大值和最小值

适用于Series和DataFrame

a.rolling(2).sum()
#   b   d   a   c
# b NaN NaN NaN NaN
# c 4.0 6.0 8.0 10.0
# a 12.0    14.0    16.0    18.0

axis默认为0，不能凑够w，以NaN填充

不管是滚动计算还是累计统计，计算所用的值都是原始值，而不是后面生成的值

2.4 相关分析

X增大，Y增大，2个变量正相关
X增大，Y减小，2个变量负相关
X增大，Y无视，2个变量不相关

协方差：么一个元素与其均值和另一个元素之间进行累计乘加操作

协方差 1.png

协方差>0，正相关；协方差<0，负相关；协方差=0，不相关

Pearson相关系数：用来衡量两个数据集合是否在一条线上面，它用来衡量定距变量间的线性关系

pearson相关系数 1.jpg

r取值范围[-1,1]，一般用绝对值（|r|）判断
0.8-1.0 极强相关；0.6-0.8 强相关；0.4-0.6 中等程度相关；0.2-0.4 弱相关；0.0-0.2 极弱相关或不相关

分析函数	说明
`.cov()`	计算协方差矩阵
`.corr()`	计算相关系数矩阵，Pearson、Spearman、Kendall等系数

适用于Series和DataFrame

import pandas as pd
hprice = pd.Series([3.04, 22.92, 12.75, 22.6, 22.33],
                   index = ['2008', '2002', '2010', '2011', '2012'])
m2 = pd.Series([8.18, 18.19, 9.13, 7.87, 6.69],
               index = ['2008', '2002', '2010', '2011', '2012'])
hprice.corr(m2) # 0.29435037215132426【弱相关】

import matplotlib.pyplot as plt
plt.plot(hprice.index, hprice, m2.index, m2)
plt.show()

Pandas库

1. Pandas库入门

1.1 Pandas简介

1.2 Series

1.3 DataFrame

1.4 类型操作

1.5 数据运算

2. 数据特征分析

2.1 数据的排序

2.2 统计分析

2.3 累计统计

2.4 相关分析

1.2 `Series`