大师兄的Python机器学习笔记:Pandas库

大师兄的Python机器学习笔记:实现评估模型
 大师兄的Python机器学习笔记:特征提取

一、关于Pandas

1. Pandas和Numpy

Pandas基于NumPy数组，使数据预处理、清洗和分析工作更快更简单。
Pandas专为处理表格和混杂数据设计，可以理解为Python中的Excel。
NumPy更适合处理统一的数值数组数据。
Pandas提供了两种类型的数据结构: DataFrame和Series。

import pandas as pd

2. DataFrame结构

DataFrame是一个表格型的数据类型，可以把DataFrame理解为Excel的表。
DataFrame是由Series组成的字典。

>>>import pandas as pd
>>>data ={"name":["pp","qq","doudou","douding","xiaobudian"],
      "age":[10,1.5,0,5,7],
       "gender":["m","m","m","f","f"]
      }
>>>df = pd.DataFrame(data)
>>>print(df)
         name   age gender
0          pp  10.0      m
1          qq   1.5      m
2      doudou   0.0      m
3     douding   5.0      f
4  xiaobudian   7.0      f

3. Series结构

Series是一种类似于一维数组的对象，它由一组数据以及一组与之相关的数据标签组成，即index和values两部分。
可以把Series理解为Excel表中的一列。

>>>import pandas as pd
>>>import numpy as np
>>>random_num = np.random.rand(10)
>>>s = pd.Series(random_num)
>>>print(s)
0    0.241130
1    0.911937
2    0.276555
3    0.570505
4    0.915634
5    0.214568
6    0.179911
7    0.113886
8    0.449848
9    0.025474
dtype: float64

二、创建表格

1. 创建Series

1.1 使用列表创建

>>>import pandas as pd
>>>s = pd.Series(["a","b","c","d","e"])
>>>print(s)
0    a
1    b
2    c
3    d
4    e
dtype: object

1.2 使用Ndarray创建

>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5))
>>>print(s)
0    0
1    1
2    2
3    3
4    4
dtype: int32

1.3 使用字典创建

>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series({'a':1,'b':2,'c':3,'d':4,'e':5})
>>>print(s)
a    1
b    2
c    3
d    4
e    5
dtype: int64

1.4 使用列表生成索引

>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5),index=['e','d','c','b','a'])
>>>print(s)
e    0
d    1
c    2
b    3
a    4
dtype: int32

2. 创建DataFrame

1.1 使用Ndarray创建

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
   One  Two  Three  Four  Five
A    0    1      2     3     4
B    5    6      7     8     9
C   10   11     12    13    14
D   15   16     17    18    19

1.2 使用Series创建

>>>import pandas as pd
>>>index=list("abcde")
>>>s = {'one':pd.Series(range(5),index=index),
>>>    'two':pd.Series(range(4,9),index=index),
>>>    'three':pd.Series(range(8,13),index=index),
>>>    'four':pd.Series(range(12,17),index=index),}
>>>fd = pd.DataFrame(data=s)
>>>print(fd)
   one  two  three  four
a    0    4      8    12
b    1    5      9    13
c    2    6     10    14
d    3    7     11    15
e    4    8     12    16

1.3 使用字典或Series组成的列表创建

>>>import pandas as pd
>>>l1 = [{'one':1,'two':2,'tree':3},
>>>    {'one':5,'two':6},
>>>    {'three':7,'four':8},
>>>    {'four':4},]
>>>fd = pd.DataFrame(data=l1)
>>>print(fd)
   one  two  tree  three  four
0  1.0  2.0   3.0    NaN   NaN
1  5.0  6.0   NaN    NaN   NaN
2  NaN  NaN   NaN    7.0   8.0
3  NaN  NaN   NaN    NaN   4.0

1.4 使用字典组成的字典创建

>>>import pandas as pd
>>>d1 = {'one':{'a':1,'b':2,'c':3,'d':4},
>>>    'two':{'a':5,'b':7,'c':6,'d':8},
>>>    'three':{'a':11,'c':12},
>>>    'four':{'b':13,'c':14},}
>>>fd = pd.DataFrame(data=d1)
>>>print(fd)
   one  two  three  four
a    1    5   11.0   NaN
b    2    7    NaN  13.0
c    3    6   12.0  14.0
d    4    8    NaN   NaN

3. 从文件读取表格

3.1 相关函数

函数	说明
read_csv()	从文件加载数据，默认分隔符为逗号。
read_table()	从文件加载数据，默认分隔符为制表符。
read_fwf()	读取定宽列格式数据，无分隔符。
read_clipboard	读取剪切板中的数据。
read_excel	从XLS或XLSX文件中加载数据。
read_hdf	从HDF5文件加载数据。
read_html	从HTML文档加载表格。
read_json	从JSON字符串加载数据。
read_msgpack	二进制格式编码的pandas数据。
read_pickle	从pickle对象读取数据。
read_sas	读取存储于SAS系统自定义存储格式的SAS数据集
read_sql	使用SQLAlchemy读取SQL查询结果
read_stata	读取Stata文件格式的数据。
read_feather	读取Feather二进制文件格式。

3.2 常用参数

参数	说明
path	表示文件系统位置、URL、文件型对象的字符串。
sep或delimiter	用于对行中各字段进行拆分的字符序列或正则表达式。
header	用作列名的行号。默认为0（第一行），如果文件没有标题行就将header参数设置为None。
index_col	用作行索引的列编号或列名。可以是单个名称/数字或有多个名称/数字组成的列表（层次化索引）。
names	用于结果的列名列表，结合header=None，可以通过names来设置标题行。
skiprows	需要忽略的行数（从0开始），设置的行数将不会进行读取。
na_values	设置需要将值替换成NA的值。
comment	用于注释信息从行尾拆分出去的字符（一个或多个）。
parse_dates	尝试将数据解析为日期，默认为False。如果为True，则尝试解析所有列。除此之外，参数可以指定需要解析的一组列号或列名。如果列表的元素为列表或元组，就会将多个列组合到一起再进行日期解析工作。
keep_date_col	如果连接多列解析日期，则保持参与连接的列。默认为False。
converters	由列号/列名跟函数之间的映射关系组成的字典。如,{"age:",f}会对列索引为age列的所有值应用函数f。
dayfirst	当解析有歧义的日期时，将其看做国际格式默认为False。
date_parser	用于解析日期的函数。
nrows	需要读取的行数。
iterator	返回一个TextParser以便逐块读取文件。
chunksize	文件块的大小（用于迭代）。
skip_footer	需要忽略的行数（从文件末尾开始计算）。
verbose	打印各种解析器输出信息，如“非数值列中的缺失值的数量”等。
encoding	用于unicode的文本编码格式。例如，"utf-8"或"gbk"等文本的编码格式。
squeeze	如果数据经过解析之后只有一列的时候，返回Series。
thousands	千分位分隔符，如","或"."。

>>>import pandas as pd
>>>import os
>>>path = os.path.join("d:\\","sample.et")
>>>fd = pd.read_table(path)
>>>print(fd)
  Unnamed: 0  one  two  three  four
0          a    1    2      3     4
1          b    5    6      7     8
2          c    9   10     11    12
3          d   13   14     15    16
4          e   17   18     19    20

三、表格的访问和增删改查

1. 访问数据

1.1 Series访问数据

使用Series[index]的方式访问数据,类似字典的键值对。

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b

1.2 Dataframe访问数据

1) loc()函数

使用column名和index名进行定位

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.loc['A':'B','One':'Two']
>>>print(v)
  One  Two
A    0    1
B    5    6

2) iloc()函数

绝对位置索引，使用行数和列数定位，起始索引为0。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.iloc[1:3,2:4]
>>>print(v)
 Three  Four
B      7     8
C     12    13

3) at()函数

用来选择单个值的，用法类似于loc。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.at['A','Two']
>>>print(v)
1

4) iat()函数

用来选择单个值的，用法类似于iloc。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>v = df.iat[2,4]
>>>print(v)
14

1.3 获得数据表信息

1) 维度

df.shape

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.shape)
(4, 5)

2) 基本信息

df.info()

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, A to D
Data columns (total 5 columns):
#   Column  Non-Null Count  Dtype
---  ------  --------------  -----
0   One     4 non-null      int32
1   Two     4 non-null      int32
2   Three   4 non-null      int32
3   Four    4 non-null      int32
4   Five    4 non-null      int32
dtypes: int32(5)
memory usage: 112.0+ bytes
None

3) 数据格式

df.dtypes 所有格式
df.dtype 某一列格式

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.dtypes,'\n') # 所有格式
>>>print(df['Two'].dtype) # 某一列格式
One      int32
Two      int32
Three    int32
Four     int32
Five     int32
dtype: object 

int32

4) 判断是否为空

df.isnull()

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.isnull())
    One    Two  Three   Four   Five
A  False  False  False  False  False
B  False  False  False  False  False
C  False  False  False  False  False
D  False  False  False  False  False

5) 获得某一列的所有唯一值

df[index].unique()

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df['Three'].unique())
[ 2  7 12 17]

6) 获得所有值

df.values

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.values)
[[ 0  1  2  3  4]
[ 5  6  7  8  9]
[10 11 12 13 14]
[15 16 17 18 19]]

7) 获得列名

df.columns

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.columns)
Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')

8) 查看头部数据/尾部数据

df.head() 头部数据
df.tail() 尾部数据

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(100).reshape((20,5))
>>>index = list(range(1,21))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.head(),'\n')
>>>print(df.tail())
  One  Two  Three  Four  Five
1    0    1      2     3     4
2    5    6      7     8     9
3   10   11     12    13    14
4   15   16     17    18    19
5   20   21     22    23    24 

   One  Two  Three  Four  Five
16   75   76     77    78    79
17   80   81     82    83    84
18   85   86     87    88    89
19   90   91     92    93    94
20   95   96     97    98    99

2. 增加数据

2.1 Series增加数据

使用append()函数增加数据。
只可以增加Series元素。
如果不指定Index,则默认从0开始计算。

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s2 = pd.Series(["d"],index=["four"])
>>>s3 = s1.append(s2)
>>>print(s3)
one      a
two      b
three    c
four     d
dtype: object

2.2 DataFrame增加数据

使用append()函数增加数据。
数据可以是Series、字典、数组等。
需要为添加的Series取个名字，或设置ignore_index=True。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>s1 = pd.Series([20,21,22,23,24],index=columns)
>>>df = df.append(s1,ignore_index=True)
>>>print(df)
   One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24

3. 删除数据

3.1 Series删除数据

使用drop(index)函数删除索引处的值。

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1 = s1.drop("one")
>>>print(s1)
two      b
three    c
dtype: object

3.2 DataFrame删除数据

1) 删除列

使用drop(columns,axis=1)函数删除。
columns为行的索引。
axis为1时表示列操作。
如果设置inplace参数为True则在原表格操作。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df = df.drop(columns='Two',axis=1)
>>>print(df)
  One  Three  Four  Five
A    0      2     3     4
B    5      7     8     9
C   10     12    13    14
D   15     17    18    19

2) 删除行

使用drop(index)函数删除。
axis默认为0。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.drop(['A','B'],inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
C   10   11     12    13    14
D   15   16     17    18    19

4. 修改数据

4.1 修改名称

rename(column,index)函数可以修改行或列名。
column和index是一个新旧名比对的字典。

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1.rename({"one":1},inplace=True)
>>>print(s1)
1        a
two      b
three    c
dtype: object

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.rename(columns={'One':'A','Two':'B','Three':'C','Four':'D','Five':'E'},index={'A':'One','B':'Two','C':'Three','D':'Four'},inplace=True)
>>>print(df)
        A   B   C   D   E
One     0   1   2   3   4
Two     5   6   7   8   9
Three  10  11  12  13  14
Four   15  16  17  18  19

4.2 修改数据

访问数据后可直接修改。

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1["two"] = 2
>>>print(s1)
one      a
two      2
three    c
dtype: object

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = "new value"
>>>print(df)
         One        Two  Three  Four  Five
A  new value  new value      2     3     4
B  new value  new value      7     8     9
C         10         11     12    13    14
D         15         16     17    18    19

5. 查询数据

5.1 使用字典的方式查询

>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df['Three']) # 读取列
>>>print(f"{'-'*20}")
>>>print(df[3:]) # 读取行
A     2
B     7
C    12
D    17
Name: Three, dtype: int32
--------------------
   One  Two  Three  Four  Five
D   15   16     17    18    19

5.2 使用定位的方式查询

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.loc['A':'B','One':'Two'])
   One  Two
A    0    1
B    5    6

四、数据清洗

1. 使用指定值填充空值。

df.fillna(value=0)

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = None
>>>df.fillna(value=999,inplace=True)
>>>print(df)
     One    Two  Three  Four  Five
A  999.0  999.0      2     3     4
B  999.0  999.0      7     8     9
C   10.0   11.0     12    13    14
D   15.0   16.0     17    18    19

2. 清除字符空格：

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'D','One':'Two'] = " with space "
>>>df['One'].map(str.strip) # 去除前后空格

3. 大小写转换：

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.array(["CONTENT"]*20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df['Two'] = df['Two'].str.lower()
>>>print(df)
       One      Two    Three     Four     Five
A  CONTENT  content  CONTENT  CONTENT  CONTENT
B  CONTENT  content  CONTENT  CONTENT  CONTENT
C  CONTENT  content  CONTENT  CONTENT  CONTENT
D  CONTENT  content  CONTENT  CONTENT  CONTENT

4. 更改数据格式

使用astype(type)改变数据格式。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df = df.astype('float')
>>>print(df)
    One   Two  Three  Four  Five
A   0.0   1.0    2.0   3.0   4.0
B   5.0   6.0    7.0   8.0   9.0
C  10.0  11.0   12.0  13.0  14.0
D  15.0  16.0   17.0  18.0  19.0

5. 去除重复值：

使用drop_duplicates(self, keep='first', inplace=False)去处重复的行。
参数keep='last'则保留最后的行。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
   One  Two  Three  Four  Five
A    1    0      2     1     4
B    3    2      4     0     4
C    1    4      2     4     0
D    3    2      3     1     1 
>>>df.drop_duplicates(['One','Two'],keep='first',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
   One  Two  Three  Four  Five
A    1    0      2     1     4
B    3    2      4     0     4
C    1    4      2     4     0

6. 数据替换

使用replace()函数替换表格中的值。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.replace(2,'B',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
  One Two Three Four  Five
A   1   0     0    0     1
B   B   B     3    B     1
C   3   1     B    B     4
D   0   3     B    3     4

五、数据预处理

1. 数据表合并

1) DataFrame.merge(df,df1,how="inner")函数

how参数表示合并的方式，有"inner"、"outer"、"left"、"right"四种方式，默认为"inner"。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns)
>>>df3 = pd.merge(df1,df2,how='outer')
>>>print(df3)
  One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24
5   25   26     27    28    29
6   30   31     32    33    34
7   35   36     37    38    39

2) DataFrame.append(df)函数

将两个DataFrame上下拼接在一起。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns)
>>>df3 = df1.append(df2)
>>>print(df3)
  One  Two  Three  Four  Five
0    0    1      2     3     4
1    5    6      7     8     9
2   10   11     12    13    14
3   15   16     17    18    19
4   20   21     22    23    24
5   25   26     27    28    29
6   30   31     32    33    34
7   35   36     37    38    39

3) DataFrame.join(df)函数

将两个DataFrame左右拼接在一起。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>n2 = np.arange(20,40).reshape((4,5))
>>>index = list('ABCD')
>>>columns1 = ['One','Two','Three','Four','Five']
>>>columns2 = ['Six','Seven','Eight','Nine','Ten']
>>>df1 = pd.DataFrame(data=n1,index=index,columns=columns1)
>>>df2 = pd.DataFrame(data=n2,index=index,columns=columns2)
>>>df3 = df1.join(df2)
>>>print(df3)
  One  Two  Three  Four  Five  Six  Seven  Eight  Nine  Ten
A    0    1      2     3     4   20     21     22    23   24
B    5    6      7     8     9   25     26     27    28   29
C   10   11     12    13    14   30     31     32    33   34
D   15   16     17    18    19   35     36     37    38   39

4) pd.concat(objs, axis=0, join='outer')函数

将多个DataFrame拼接在一起。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(10).reshape((2,5))
>>>n2 = np.arange(10,20).reshape((2,5))
>>>n3 = np.arange(20,30).reshape((2,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df1 = pd.DataFrame(data=n1,index=["A","B"],columns=columns)
>>>df2 = pd.DataFrame(data=n2,index=["C","D"],columns=columns)
>>>df3 = pd.DataFrame(data=n3,index=["E","F"],columns=columns)
>>>df4 = pd.concat([df1,df2,df3])
>>>print(df4)
 One  Two  Three  Four  Five
A    0    1      2     3     4
B    5    6      7     8     9
C   10   11     12    13    14
D   15   16     17    18    19
E   20   21     22    23    24
F   25   26     27    28    29

2. 设置复合索引

使用set_index('id')将列设置为行的索引。
使用reset_index('id')将行索引恢复为列。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(10).reshape((2,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=["A","B"],columns=columns)
>>>df = df.set_index('Four')
>>>print(df)
      One  Two  Three  Five
Four                       
3       0    1      2     4
8       5    6      7     9

>>>df = df.reset_index('Four')
>>>print(df)
   Four  One  Two  Three  Five
0     3    0    1      2     4
1     8    5    6      7     9

3. 排序

1) 按值排序

DataFrame.sort_values(by)

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=["A","B","C","D"],columns=columns)
>>>df.sort_values(by=["One"],inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
D    2    0      3     3     3
A    3    0      2     3     2
C    3    2      3     2     1
B    4    2      2     3     3

2) 按索引排序

DataFrame.sort_index()

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=[4,1,3,2],columns=columns)
>>>df.sort_index(inplace=True)
>>>print(df)
  One  Two  Three  Four  Five
1    5    6      7     8     9
2   15   16     17    18    19
3   10   11     12    13    14
4    0    1      2     3     4

3. 分组标记

1. 根据值分组标记

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df['group'] = np.where(df['Three'] > 10,'high','low')
>>>print(df)
  One  Two  Three  Four  Five group
A    0    1      2     3     4   low
B    5    6      7     8     9   low
C   10   11     12    13    14  high
D   15   16     17    18    19  high

2. 对复合条件进行分组标记

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.loc[(df['Three']==4)&(df['Four']<3),'sign']= "target"
>>>print(df)
  One  Two  Three  Four  Five    sign
A    0    1      2     0     3     NaN
B    2    4      4     0     1  target
C    4    1      0     2     4     NaN
D    3    1      4     0     0  target

3. 对字段分组并创建新表

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>d1 = pd.DataFrame(((x,(np.where(x > 3,'high','low'))) for x in >>>>df['Two']),index=df.index,columns=['value','type'])
>>>print(d1)
  value  type
A      1   low
B      4  high
C      3   low
D      4  high

六、数据筛选

1. “与”筛选

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)&(df['Four']<3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
B    4    2      2     2     4

2. “或”筛选

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)|(df['Four']<3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
B    1    4      4     0     2
C    3    1      4     2     2

3. “非”筛选

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']!=3),columns]
>>>print(df1)
   One  Two  Three  Four  Five
A    1    2      1     2     0
B    4    2      4     0     2

4. DataFrame.query(expr, inplace = False, ** kwargs)函数

query()函数可以使用布尔表达式查询列。
expr为查询条件字符串，可以使用'@'引入变量。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>num = 3
>>>df.query('One<@num and Two>@num',inplace=True)
>>>print(df)
   One  Two  Three  Four  Five
A    1    4      0     0     3

七、数据统计

1. 统计数据长度

使用count(axis)统计列或行的长度。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>print(df,'\n')
>>>print(df.count(1),'\n')
>>>print(df.count(0))
   One  Two  Three  Four  Five
A    4    2      4     1     0
B    1    4      0     1     3
C    0    3      4     0     1
D    4    3      3     4     0 

A    5
B    5
C    5
D    5
dtype: int64 

One      4
Two      4
Three    4
Four     4
Five     4
dtype: int64

2. 聚合操作

使用FramData.agg()进行基于列的聚合操作。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>np2 = df.agg([len,np.sum, np.mean]) # 计算长度，合以及平均数
>>>np2
    One  Two  Three  Four  Five
len   4.0  4.0    4.0   4.0  4.00
sum   6.0  6.0   10.0   8.0  9.00
mean  1.5  1.5    2.5   2.0  2.25

3.简单的数据采样

使用DataFrame.sample(n)进行简单采样。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2)
    One Two Three   Four    Five
D   2   2   4   0   1
A   1   1   1   1   2

4.权重数据采样

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>weights = [0,1,0,1] # 设置权重
>>>df.sample(n=2,weights=weights)
One Two Three   Four    Five
B   2   2   4   3   1
D   0   2   1   0   0

5.采样数据放回

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2,replace=True) # 参数replace

6.描述性统计

DataFrame.describe()获得数据的统计信息。

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.describe().round(2).T
       count  mean   std  min   25%  50%   75%  max
One      4.0  1.75  1.50  1.0  1.00  1.0  1.75  4.0
Two      4.0  2.50  1.73  0.0  2.25  3.0  3.25  4.0
Three    4.0  2.50  1.29  1.0  1.75  2.5  3.25  4.0
Four     4.0  3.50  1.00  2.0  3.50  4.0  4.00  4.0
Five     4.0  3.25  0.50  3.0  3.00  3.0  3.25  4.0

7.计算标准差

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.std() # 计算标准差
One      1.825742
Two      1.258306
Three    1.414214
Four     0.957427
Five     1.414214
dtype: float64

8.计算协方差

import pandas as pd
import numpy as np
n1 = np.random.randint(5, size=(4,5))
columns = ['One','Two','Three','Four','Five']
df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
df.cov() # 计算协方差
            One       Two     Three      Four      Five
One    3.583333  3.083333 -1.666667 -1.833333 -2.916667
Two    3.083333  2.916667 -2.333333 -1.166667 -3.083333
Three -1.666667 -2.333333  4.000000 -0.666667  3.000000
Four  -1.833333 -1.166667 -0.666667  1.666667  0.833333
Five  -2.916667 -3.083333  3.000000  0.833333  4.250000

9.相关性分析

>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.corr()
            One       Two     Three      Four      Five
One    1.000000 -0.885615 -0.342997 -0.792118 -0.980196
Two   -0.885615  1.000000  0.258199  0.670820  0.948683
Three -0.342997  0.258199  1.000000 -0.288675  0.408248
Four  -0.792118  0.670820 -0.288675  1.000000  0.707107
Five  -0.980196  0.948683  0.408248  0.707107  1.000000