大师兄的Python机器学习笔记:实现评估模型
大师兄的Python机器学习笔记:特征提取
一、关于Pandas
1. Pandas和Numpy
- Pandas基于NumPy数组,使数据预处理、清洗和分析工作更快更简单。
- Pandas专为处理表格和混杂数据设计,可以理解为Python中的Excel。
- NumPy更适合处理统一的数值数组数据。
- Pandas提供了两种类型的数据结构: DataFrame和Series。
import pandas as pd
2. DataFrame结构
- DataFrame是一个表格型的数据类型,可以把DataFrame理解为Excel的表。
- DataFrame是由Series组成的字典。
>>>import pandas as pd
>>>data ={"name":["pp","qq","doudou","douding","xiaobudian"],
"age":[10,1.5,0,5,7],
"gender":["m","m","m","f","f"]
}
>>>df = pd.DataFrame(data)
>>>print(df)
name age gender
0 pp 10.0 m
1 qq 1.5 m
2 doudou 0.0 m
3 douding 5.0 f
4 xiaobudian 7.0 f
3. Series结构
- Series是一种类似于一维数组的对象,它由一组数据以及一组与之相关的数据标签组成,即index和values两部分。
- 可以把Series理解为Excel表中的一列。
>>>import pandas as pd
>>>import numpy as np
>>>random_num = np.random.rand(10)
>>>s = pd.Series(random_num)
>>>print(s)
0 0.241130
1 0.911937
2 0.276555
3 0.570505
4 0.915634
5 0.214568
6 0.179911
7 0.113886
8 0.449848
9 0.025474
dtype: float64
二、创建表格
1. 创建Series
1.1 使用列表创建
>>>import pandas as pd
>>>s = pd.Series(["a","b","c","d","e"])
>>>print(s)
0 a
1 b
2 c
3 d
4 e
dtype: object
1.2 使用Ndarray创建
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5))
>>>print(s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
1.3 使用字典创建
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series({'a':1,'b':2,'c':3,'d':4,'e':5})
>>>print(s)
a 1
b 2
c 3
d 4
e 5
dtype: int64
1.4 使用列表生成索引
>>>import pandas as pd
>>>import numpy as np
>>>s = pd.Series(np.arange(5),index=['e','d','c','b','a'])
>>>print(s)
e 0
d 1
c 2
b 3
a 4
dtype: int32
2. 创建DataFrame
1.1 使用Ndarray创建
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
One Two Three Four Five
A 0 1 2 3 4
B 5 6 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
1.2 使用Series创建
>>>import pandas as pd
>>>index=list("abcde")
>>>s = {'one':pd.Series(range(5),index=index),
>>> 'two':pd.Series(range(4,9),index=index),
>>> 'three':pd.Series(range(8,13),index=index),
>>> 'four':pd.Series(range(12,17),index=index),}
>>>fd = pd.DataFrame(data=s)
>>>print(fd)
one two three four
a 0 4 8 12
b 1 5 9 13
c 2 6 10 14
d 3 7 11 15
e 4 8 12 16
1.3 使用字典或Series组成的列表创建
>>>import pandas as pd
>>>l1 = [{'one':1,'two':2,'tree':3},
>>> {'one':5,'two':6},
>>> {'three':7,'four':8},
>>> {'four':4},]
>>>fd = pd.DataFrame(data=l1)
>>>print(fd)
one two tree three four
0 1.0 2.0 3.0 NaN NaN
1 5.0 6.0 NaN NaN NaN
2 NaN NaN NaN 7.0 8.0
3 NaN NaN NaN NaN 4.0
1.4 使用字典组成的字典创建
>>>import pandas as pd
>>>d1 = {'one':{'a':1,'b':2,'c':3,'d':4},
>>> 'two':{'a':5,'b':7,'c':6,'d':8},
>>> 'three':{'a':11,'c':12},
>>> 'four':{'b':13,'c':14},}
>>>fd = pd.DataFrame(data=d1)
>>>print(fd)
one two three four
a 1 5 11.0 NaN
b 2 7 NaN 13.0
c 3 6 12.0 14.0
d 4 8 NaN NaN
3. 从文件读取表格
3.1 相关函数
函数 | 说明 |
---|---|
read_csv() | 从文件加载数据,默认分隔符为逗号。 |
read_table() | 从文件加载数据,默认分隔符为制表符。 |
read_fwf() | 读取定宽列格式数据,无分隔符。 |
read_clipboard | 读取剪切板中的数据。 |
read_excel | 从XLS或XLSX文件中加载数据。 |
read_hdf | 从HDF5文件加载数据。 |
read_html | 从HTML文档加载表格。 |
read_json | 从JSON字符串加载数据。 |
read_msgpack | 二进制格式编码的pandas数据。 |
read_pickle | 从pickle对象读取数据。 |
read_sas | 读取存储于SAS系统自定义存储格式的SAS数据集 |
read_sql | 使用SQLAlchemy读取SQL查询结果 |
read_stata | 读取Stata文件格式的数据。 |
read_feather | 读取Feather二进制文件格式。 |
3.2 常用参数
参数 | 说明 |
---|---|
path | 表示文件系统位置、URL、文件型对象的字符串。 |
sep或delimiter | 用于对行中各字段进行拆分的字符序列或正则表达式。 |
header | 用作列名的行号。 默认为0(第一行),如果文件没有标题行就将header参数设置为None。 |
index_col | 用作行索引的列编号或列名。 可以是单个名称/数字或有多个名称/数字组成的列表(层次化索引)。 |
names | 用于结果的列名列表,结合header=None,可以通过names来设置标题行。 |
skiprows | 需要忽略的行数(从0开始),设置的行数将不会进行读取。 |
na_values | 设置需要将值替换成NA的值。 |
comment | 用于注释信息从行尾拆分出去的字符(一个或多个)。 |
parse_dates | 尝试将数据解析为日期,默认为False。 如果为True,则尝试解析所有列。 除此之外,参数可以指定需要解析的一组列号或列名。 如果列表的元素为列表或元组,就会将多个列组合到一起再进行日期解析工作。 |
keep_date_col | 如果连接多列解析日期,则保持参与连接的列。 默认为False。 |
converters | 由列号/列名跟函数之间的映射关系组成的字典。 如,{"age:",f}会对列索引为age列的所有值应用函数f。 |
dayfirst | 当解析有歧义的日期时,将其看做国际格式默认为False。 |
date_parser | 用于解析日期的函数。 |
nrows | 需要读取的行数。 |
iterator | 返回一个TextParser以便逐块读取文件。 |
chunksize | 文件块的大小(用于迭代)。 |
skip_footer | 需要忽略的行数(从文件末尾开始计算)。 |
verbose | 打印各种解析器输出信息,如“非数值列中的缺失值的数量”等。 |
encoding | 用于unicode的文本编码格式。例如,"utf-8"或"gbk"等文本的编码格式。 |
squeeze | 如果数据经过解析之后只有一列的时候,返回Series。 |
thousands | 千分位分隔符,如","或"."。 |
>>>import pandas as pd
>>>import os
>>>path = os.path.join("d:\\","sample.et")
>>>fd = pd.read_table(path)
>>>print(fd)
Unnamed: 0 one two three four
0 a 1 2 3 4
1 b 5 6 7 8
2 c 9 10 11 12
3 d 13 14 15 16
4 e 17 18 19 20
三、表格的访问和增删改查
1. 访问数据
1.1 Series访问数据
- 使用
Series[index]
的方式访问数据,类似字典的键值对。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b
1.2 Dataframe访问数据
1) loc()函数
- 使用column名和index名进行定位
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>v = df.loc['A':'B','One':'Two'] >>>print(v) One Two A 0 1 B 5 6
2) iloc()函数
- 绝对位置索引,使用行数和列数定位,起始索引为0。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>v = df.iloc[1:3,2:4] >>>print(v) Three Four B 7 8 C 12 13
3) at()函数
- 用来选择单个值的,用法类似于loc。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>v = df.at['A','Two'] >>>print(v) 1
4) iat()函数
- 用来选择单个值的,用法类似于iloc。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>v = df.iat[2,4] >>>print(v) 14
1.3 获得数据表信息
1) 维度
df.shape
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.shape) (4, 5)
2) 基本信息
df.info()
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.info()) <class 'pandas.core.frame.DataFrame'> Index: 4 entries, A to D Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 One 4 non-null int32 1 Two 4 non-null int32 2 Three 4 non-null int32 3 Four 4 non-null int32 4 Five 4 non-null int32 dtypes: int32(5) memory usage: 112.0+ bytes None
3) 数据格式
df.dtypes
所有格式df.dtype
某一列格式>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.dtypes,'\n') # 所有格式 >>>print(df['Two'].dtype) # 某一列格式 One int32 Two int32 Three int32 Four int32 Five int32 dtype: object int32
4) 判断是否为空
df.isnull()
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.isnull()) One Two Three Four Five A False False False False False B False False False False False C False False False False False D False False False False False
5) 获得某一列的所有唯一值
df[index].unique()
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df['Three'].unique()) [ 2 7 12 17]
6) 获得所有值
df.values
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.values) [[ 0 1 2 3 4] [ 5 6 7 8 9] [10 11 12 13 14] [15 16 17 18 19]]
7) 获得列名
df.columns
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.columns) Index(['One', 'Two', 'Three', 'Four', 'Five'], dtype='object')
8) 查看头部数据/尾部数据
df.head()
头部数据df.tail()
尾部数据>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(100).reshape((20,5)) >>>index = list(range(1,21)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>print(df.head(),'\n') >>>print(df.tail()) One Two Three Four Five 1 0 1 2 3 4 2 5 6 7 8 9 3 10 11 12 13 14 4 15 16 17 18 19 5 20 21 22 23 24 One Two Three Four Five 16 75 76 77 78 79 17 80 81 82 83 84 18 85 86 87 88 89 19 90 91 92 93 94 20 95 96 97 98 99
2. 增加数据
2.1 Series增加数据
- 使用
append()
函数增加数据。 - 只可以增加Series元素。
- 如果不指定Index,则默认从0开始计算。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s2 = pd.Series(["d"],index=["four"])
>>>s3 = s1.append(s2)
>>>print(s3)
one a
two b
three c
four d
dtype: object
2.2 DataFrame增加数据
- 使用
append()
函数增加数据。 - 数据可以是Series、字典、数组等。
- 需要为添加的Series取个名字,或设置
ignore_index=True
。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>s1 = pd.Series([20,21,22,23,24],index=columns)
>>>df = df.append(s1,ignore_index=True)
>>>print(df)
One Two Three Four Five
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
3. 删除数据
3.1 Series删除数据
- 使用
drop(index)
函数删除索引处的值。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1 = s1.drop("one")
>>>print(s1)
two b
three c
dtype: object
3.2 DataFrame删除数据
1) 删除列
- 使用
drop(columns,axis=1)
函数删除。- columns为行的索引。
- axis为1时表示列操作。
- 如果设置inplace参数为True则在原表格操作。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>df = df.drop(columns='Two',axis=1) >>>print(df) One Three Four Five A 0 2 3 4 B 5 7 8 9 C 10 12 13 14 D 15 17 18 19
2) 删除行
- 使用
drop(index)
函数删除。- axis默认为0。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=index,columns=columns) >>>df.drop(['A','B'],inplace=True) >>>print(df) One Two Three Four Five C 10 11 12 13 14 D 15 16 17 18 19
4. 修改数据
4.1 修改名称
-
rename(column,index)
函数可以修改行或列名。 - column和index是一个新旧名比对的字典。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1.rename({"one":1},inplace=True)
>>>print(s1)
1 a
two b
three c
dtype: object
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.rename(columns={'One':'A','Two':'B','Three':'C','Four':'D','Five':'E'},index={'A':'One','B':'Two','C':'Three','D':'Four'},inplace=True)
>>>print(df)
A B C D E
One 0 1 2 3 4
Two 5 6 7 8 9
Three 10 11 12 13 14
Four 15 16 17 18 19
4.2 修改数据
- 访问数据后可直接修改。
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>s1["two"] = 2
>>>print(s1)
one a
two 2
three c
dtype: object
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = "new value"
>>>print(df)
One Two Three Four Five
A new value new value 2 3 4
B new value new value 7 8 9
C 10 11 12 13 14
D 15 16 17 18 19
5. 查询数据
5.1 使用字典的方式查询
>>>import pandas as pd
>>>s1 = pd.Series(["a","b","c"],index=["one","two","three"])
>>>print(s1["two"])
b
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df['Three']) # 读取列
>>>print(f"{'-'*20}")
>>>print(df[3:]) # 读取行
A 2
B 7
C 12
D 17
Name: Three, dtype: int32
--------------------
One Two Three Four Five
D 15 16 17 18 19
5.2 使用定位的方式查询
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df.loc['A':'B','One':'Two'])
One Two
A 0 1
B 5 6
四、数据清洗
1. 使用指定值填充空值。
df.fillna(value=0)
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'B','One':'Two'] = None
>>>df.fillna(value=999,inplace=True)
>>>print(df)
One Two Three Four Five
A 999.0 999.0 2 3 4
B 999.0 999.0 7 8 9
C 10.0 11.0 12 13 14
D 15.0 16.0 17 18 19
2. 清除字符空格:
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.loc['A':'D','One':'Two'] = " with space "
>>>df['One'].map(str.strip) # 去除前后空格
3. 大小写转换:
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.array(["CONTENT"]*20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df['Two'] = df['Two'].str.lower()
>>>print(df)
One Two Three Four Five
A CONTENT content CONTENT CONTENT CONTENT
B CONTENT content CONTENT CONTENT CONTENT
C CONTENT content CONTENT CONTENT CONTENT
D CONTENT content CONTENT CONTENT CONTENT
4. 更改数据格式
- 使用
astype(type)
改变数据格式。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(20).reshape((4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df = df.astype('float')
>>>print(df)
One Two Three Four Five
A 0.0 1.0 2.0 3.0 4.0
B 5.0 6.0 7.0 8.0 9.0
C 10.0 11.0 12.0 13.0 14.0
D 15.0 16.0 17.0 18.0 19.0
5. 去除重复值:
- 使用
drop_duplicates(self, keep='first', inplace=False)
去处重复的行。 - 参数
keep='last'
则保留最后的行。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>print(df)
One Two Three Four Five
A 1 0 2 1 4
B 3 2 4 0 4
C 1 4 2 4 0
D 3 2 3 1 1
>>>df.drop_duplicates(['One','Two'],keep='first',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
One Two Three Four Five
A 1 0 2 1 4
B 3 2 4 0 4
C 1 4 2 4 0
6. 数据替换
- 使用
replace()
函数替换表格中的值。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>index = list('ABCD')
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=index,columns=columns)
>>>df.replace(2,'B',inplace=True) # 只保留One Two重复的第一组
>>>print(df)
One Two Three Four Five
A 1 0 0 0 1
B B B 3 B 1
C 3 1 B B 4
D 0 3 B 3 4
五、数据预处理
1. 数据表合并
1) DataFrame.merge(df,df1,how="inner")函数
how
参数表示合并的方式,有"inner"、"outer"、"left"、"right"四种方式,默认为"inner"。>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>n2 = np.arange(20,40).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df1 = pd.DataFrame(data=n1,index=index,columns=columns) >>>df2 = pd.DataFrame(data=n2,index=index,columns=columns) >>>df3 = pd.merge(df1,df2,how='outer') >>>print(df3) One Two Three Four Five 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24 5 25 26 27 28 29 6 30 31 32 33 34 7 35 36 37 38 39
2) DataFrame.append(df)函数
- 将两个DataFrame上下拼接在一起。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>n2 = np.arange(20,40).reshape((4,5)) >>>index = list('ABCD') >>>columns = ['One','Two','Three','Four','Five'] >>>df1 = pd.DataFrame(data=n1,index=index,columns=columns) >>>df2 = pd.DataFrame(data=n2,index=index,columns=columns) >>>df3 = df1.append(df2) >>>print(df3) One Two Three Four Five 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24 5 25 26 27 28 29 6 30 31 32 33 34 7 35 36 37 38 39
3) DataFrame.join(df)函数
- 将两个DataFrame左右拼接在一起。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>n2 = np.arange(20,40).reshape((4,5)) >>>index = list('ABCD') >>>columns1 = ['One','Two','Three','Four','Five'] >>>columns2 = ['Six','Seven','Eight','Nine','Ten'] >>>df1 = pd.DataFrame(data=n1,index=index,columns=columns1) >>>df2 = pd.DataFrame(data=n2,index=index,columns=columns2) >>>df3 = df1.join(df2) >>>print(df3) One Two Three Four Five Six Seven Eight Nine Ten A 0 1 2 3 4 20 21 22 23 24 B 5 6 7 8 9 25 26 27 28 29 C 10 11 12 13 14 30 31 32 33 34 D 15 16 17 18 19 35 36 37 38 39
4) pd.concat(objs, axis=0, join='outer')函数
- 将多个DataFrame拼接在一起。
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(10).reshape((2,5)) >>>n2 = np.arange(10,20).reshape((2,5)) >>>n3 = np.arange(20,30).reshape((2,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df1 = pd.DataFrame(data=n1,index=["A","B"],columns=columns) >>>df2 = pd.DataFrame(data=n2,index=["C","D"],columns=columns) >>>df3 = pd.DataFrame(data=n3,index=["E","F"],columns=columns) >>>df4 = pd.concat([df1,df2,df3]) >>>print(df4) One Two Three Four Five A 0 1 2 3 4 B 5 6 7 8 9 C 10 11 12 13 14 D 15 16 17 18 19 E 20 21 22 23 24 F 25 26 27 28 29
2. 设置复合索引
- 使用
set_index('id')
将列设置为行的索引。 - 使用
reset_index('id')
将行索引恢复为列。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.arange(10).reshape((2,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=["A","B"],columns=columns)
>>>df = df.set_index('Four')
>>>print(df)
One Two Three Five
Four
3 0 1 2 4
8 5 6 7 9
>>>df = df.reset_index('Four')
>>>print(df)
Four One Two Three Five
0 3 0 1 2 4
1 8 5 6 7 9
3. 排序
1) 按值排序
DataFrame.sort_values(by)
>>>import pandas as pd >>>import numpy as np >>>n1 = np.random.randint(5, size=(4,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=["A","B","C","D"],columns=columns) >>>df.sort_values(by=["One"],inplace=True) >>>print(df) One Two Three Four Five D 2 0 3 3 3 A 3 0 2 3 2 C 3 2 3 2 1 B 4 2 2 3 3
2) 按索引排序
DataFrame.sort_index()
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=[4,1,3,2],columns=columns) >>>df.sort_index(inplace=True) >>>print(df) One Two Three Four Five 1 5 6 7 8 9 2 15 16 17 18 19 3 10 11 12 13 14 4 0 1 2 3 4
3. 分组标记
1. 根据值分组标记
>>>import pandas as pd >>>import numpy as np >>>n1 = np.arange(20).reshape((4,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns) >>>df['group'] = np.where(df['Three'] > 10,'high','low') >>>print(df) One Two Three Four Five group A 0 1 2 3 4 low B 5 6 7 8 9 low C 10 11 12 13 14 high D 15 16 17 18 19 high
2. 对复合条件进行分组标记
>>>import pandas as pd >>>import numpy as np >>>n1 = np.random.randint(5, size=(4,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns) >>>df.loc[(df['Three']==4)&(df['Four']<3),'sign']= "target" >>>print(df) One Two Three Four Five sign A 0 1 2 0 3 NaN B 2 4 4 0 1 target C 4 1 0 2 4 NaN D 3 1 4 0 0 target
3. 对字段分组并创建新表
>>>import pandas as pd >>>import numpy as np >>>n1 = np.random.randint(5, size=(4,5)) >>>columns = ['One','Two','Three','Four','Five'] >>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns) >>>d1 = pd.DataFrame(((x,(np.where(x > 3,'high','low'))) for x in >>>>df['Two']),index=df.index,columns=['value','type']) >>>print(d1) value type A 1 low B 4 high C 3 low D 4 high
六、数据筛选
1. “与”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)&(df['Four']<3),columns]
>>>print(df1)
One Two Three Four Five
B 4 2 2 2 4
2. “或”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']>3)|(df['Four']<3),columns]
>>>print(df1)
One Two Three Four Five
B 1 4 4 0 2
C 3 1 4 2 2
3. “非”筛选
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df1 = df.loc[(df['One']!=3),columns]
>>>print(df1)
One Two Three Four Five
A 1 2 1 2 0
B 4 2 4 0 2
4. DataFrame.query(expr, inplace = False, ** kwargs)函数
- query()函数可以使用布尔表达式查询列。
- expr为查询条件字符串,可以使用'@'引入变量。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>num = 3
>>>df.query('One<@num and Two>@num',inplace=True)
>>>print(df)
One Two Three Four Five
A 1 4 0 0 3
七、数据统计
1. 统计数据长度
- 使用
count(axis)
统计列或行的长度。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>print(df,'\n')
>>>print(df.count(1),'\n')
>>>print(df.count(0))
One Two Three Four Five
A 4 2 4 1 0
B 1 4 0 1 3
C 0 3 4 0 1
D 4 3 3 4 0
A 5
B 5
C 5
D 5
dtype: int64
One 4
Two 4
Three 4
Four 4
Five 4
dtype: int64
2. 聚合操作
- 使用
FramData.agg()
进行基于列的聚合操作。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>np2 = df.agg([len,np.sum, np.mean]) # 计算长度,合以及平均数
>>>np2
One Two Three Four Five
len 4.0 4.0 4.0 4.0 4.00
sum 6.0 6.0 10.0 8.0 9.00
mean 1.5 1.5 2.5 2.0 2.25
3.简单的数据采样
- 使用
DataFrame.sample(n)
进行简单采样。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2)
One Two Three Four Five
D 2 2 4 0 1
A 1 1 1 1 2
4.权重数据采样
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>weights = [0,1,0,1] # 设置权重
>>>df.sample(n=2,weights=weights)
One Two Three Four Five
B 2 2 4 3 1
D 0 2 1 0 0
5.采样数据放回
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.sample(n=2,replace=True) # 参数replace
6.描述性统计
-
DataFrame.describe()
获得数据的统计信息。
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.describe().round(2).T
count mean std min 25% 50% 75% max
One 4.0 1.75 1.50 1.0 1.00 1.0 1.75 4.0
Two 4.0 2.50 1.73 0.0 2.25 3.0 3.25 4.0
Three 4.0 2.50 1.29 1.0 1.75 2.5 3.25 4.0
Four 4.0 3.50 1.00 2.0 3.50 4.0 4.00 4.0
Five 4.0 3.25 0.50 3.0 3.00 3.0 3.25 4.0
7.计算标准差
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.std() # 计算标准差
One 1.825742
Two 1.258306
Three 1.414214
Four 0.957427
Five 1.414214
dtype: float64
8.计算协方差
import pandas as pd
import numpy as np
n1 = np.random.randint(5, size=(4,5))
columns = ['One','Two','Three','Four','Five']
df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
df.cov() # 计算协方差
One Two Three Four Five
One 3.583333 3.083333 -1.666667 -1.833333 -2.916667
Two 3.083333 2.916667 -2.333333 -1.166667 -3.083333
Three -1.666667 -2.333333 4.000000 -0.666667 3.000000
Four -1.833333 -1.166667 -0.666667 1.666667 0.833333
Five -2.916667 -3.083333 3.000000 0.833333 4.250000
9.相关性分析
>>>import pandas as pd
>>>import numpy as np
>>>n1 = np.random.randint(5, size=(4,5))
>>>columns = ['One','Two','Three','Four','Five']
>>>df = pd.DataFrame(data=n1,index=list("ABCD"),columns=columns)
>>>df.corr()
One Two Three Four Five
One 1.000000 -0.885615 -0.342997 -0.792118 -0.980196
Two -0.885615 1.000000 0.258199 0.670820 0.948683
Three -0.342997 0.258199 1.000000 -0.288675 0.408248
Four -0.792118 0.670820 -0.288675 1.000000 0.707107
Five -0.980196 0.948683 0.408248 0.707107 1.000000