数据库风格的 DataFrame 或命名 Series 的合并
pandas
具有功能齐全且高性能的连接操作,与 SQL
关系型数据库类似
这些方法的性能明显优于其他开源的工具,在某些情况下可能远远超过一个数量级。如 R
语言中的 base::merge.data.frame
。
其优良的性能源自精心设计的算法和 DataFrame
中数据的内部布局
熟悉 SQL
但不熟悉 pandas
的用户可能会对其与 SQL
的比较感兴趣
pandas
的 merge()
函数,提供了 DataFrame
或命名 Series
对象之间的所有标准数据库连接操作
参数列表
pd.merge(
left,
right,
how="inner",
on=None,
left_on=None,
right_on=None,
left_index=False,
right_index=False,
sort=True,
suffixes=("_x", "_y"),
copy=True,
indicator=False,
validate=None,
)
注意:
在 0.23.0
开始,on
, left_on
和 right_on
参数支持指定索引的级别,从 0.24.0
开始支持对命名 Series
的合并
merge
是 pandas
的顶层方法,但是也可以作为 DataFrame
对象的实例方法,调用的 DataFrame
对象被隐式的视为连接的左侧对象
相关的 join()
方法内部的实现是基于 merge
方法,主要用于索引和索引,列和索引的连接。
如果只是想通过索引来连接,可以考虑使用 join()
减少输入
1 merge 函数简要入门(关系代数)
我们可以将 DataFrame
看作是 SQL
的表,而熟悉 SQL
关系型数据框的人应该对下面的术语很熟悉
-
one-to-one
(一对一) -
many-to-one
(多对一) -
many-to-many
(多对多)
注意:在进行列与列的合并时,用于连接的 DataFrame
对象上的索引都会被丢弃
在 SQL
或标准关系代数中,如果一个连接键在两张表中出现一次以上,那么生成的表将具有相关数据的笛卡尔积。
下面这个例子中,连接的键是唯一的
In [39]: left = pd.DataFrame(
....: {
....: "key": ["K0", "K1", "K2", "K3"],
....: "A": ["A0", "A1", "A2", "A3"],
....: "B": ["B0", "B1", "B2", "B3"],
....: }
....: )
....:
In [40]: right = pd.DataFrame(
....: {
....: "key": ["K0", "K1", "K2", "K3"],
....: "C": ["C0", "C1", "C2", "C3"],
....: "D": ["D0", "D1", "D2", "D3"],
....: }
....: )
....:
In [41]: result = pd.merge(left, right, on="key")
下面是一个更加复杂的例子,具有多个连接的键。默认的连接方式是 how='inner'
,即指定的键的数据要同时出现在 left
和 right
对象中
In [42]: left = pd.DataFrame(
....: {
....: "key1": ["K0", "K0", "K1", "K2"],
....: "key2": ["K0", "K1", "K0", "K1"],
....: "A": ["A0", "A1", "A2", "A3"],
....: "B": ["B0", "B1", "B2", "B3"],
....: }
....: )
....:
In [43]: right = pd.DataFrame(
....: {
....: "key1": ["K0", "K1", "K1", "K2"],
....: "key2": ["K0", "K0", "K0", "K0"],
....: "C": ["C0", "C1", "C2", "C3"],
....: "D": ["D0", "D1", "D2", "D3"],
....: }
....: )
....:
In [44]: result = pd.merge(left, right, on=["key1", "key2"])
merge
的 how
参数决定了什么样的键应该包含在结果中。以下是连接的方式以及对应的 SQL
操作
In [45]: result = pd.merge(left, right, how="left", on=["key1", "key2"])
In [46]: result = pd.merge(left, right, how="right", on=["key1", "key2"])
In [47]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])
In [48]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])
如果 MultiIndex
的级别名称与 DataFrame
中的列名相对应,则可以合并一个 MultiIndex
的 Series
和 DataFrame
在合并之前,可以使用 Series.reset_index()
将 Series
转换为 DataFrame
In [49]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})
In [50]: df
Out[50]:
Let Num
0 A 1
1 B 2
2 C 3
In [51]: ser = pd.Series(
....: ["a", "b", "c", "d", "e", "f"],
....: index=pd.MultiIndex.from_arrays(
....: [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
....: ),
....: )
....:
In [52]: ser
Out[52]:
Let Num
A 1 a
B 2 b
C 3 c
A 4 d
B 5 e
C 6 f
dtype: object
In [53]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
Out[53]:
Let Num 0
0 A 1 a
1 B 2 b
2 C 3 c
下面是另一个在 DataFrame
中使用重复连接键的例子
In [54]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
In [55]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
In [56]: result = pd.merge(left, right, on="B", how="outer")
2 检查重复的键
用户可以使用 validate
参数自动检查其合并的键中是否有意外的重复项
在合并操作之前检查键的唯一性,可以防止内存溢出,也是确保用户数据结构符合预期的一种好方法
在下面的示例中,right
的 B
列存在重复值。因此不是一对一的合并,validate
参数中所指定的方式将引发异常
In [57]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
In [58]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
>>> result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
...
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
如果用户知道 right
中存在重复值,但又想确保 left
中没有重复,可以使用 validate='one_to_many'
参数来代替,这样就不会引发异常
In [59]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
Out[59]:
A_x B A_y
0 1 1 NaN
1 2 2 4.0
2 2 2 5.0
3 2 2 6.0
3 merge 指示器
merge
接受一个 indicator
参数,如果为 True
,则会在输出对象中添加一个名为 _merge
的分类型列
In [60]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
In [61]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
In [62]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
Out[62]:
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
indicator
参数也接受字符串,在这种情况下,将使用传递的字符串的值作为 indicator
列的名称
In [63]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
Out[63]:
col1 col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
4 merge 数据类型
对于下面的 DataFrame
对象
In [64]: left = pd.DataFrame({"key": [1], "v1": [10]})
In [65]: left
Out[65]:
key v1
0 1 10
In [66]: right = pd.DataFrame({"key": [1, 2], "v1": [20, 30]})
In [67]: right
Out[67]:
key v1
0 1 20
1 2 30
合并操作会保留连接的键的数据类型
In [68]: pd.merge(left, right, how="outer")
Out[68]:
key v1
0 1 10
1 1 20
2 2 30
In [69]: pd.merge(left, right, how="outer").dtypes
Out[69]:
key int64
v1 int64
dtype: object
当然,如果引入了缺失值,则数据类型会发生向上转换
In [70]: pd.merge(left, right, how="outer", on="key")
Out[70]:
key v1_x v1_y
0 1 10.0 20
1 2 NaN 30
In [71]: pd.merge(left, right, how="outer", on="key").dtypes
Out[71]:
key int64
v1_x float64
v1_y int64
dtype: object
合并操作也能够保留 category
类型。例如,有如下的 left
对象
In [72]: from pandas.api.types import CategoricalDtype
In [73]: X = pd.Series(np.random.choice(["foo", "bar"], size=(10,)))
In [74]: X = X.astype(CategoricalDtype(categories=["foo", "bar"]))
In [75]: left = pd.DataFrame(
....: {"X": X, "Y": np.random.choice(["one", "two", "three"], size=(10,))}
....: )
....:
In [76]: left
Out[76]:
X Y
0 bar one
1 foo one
2 foo three
3 bar three
4 foo one
5 bar one
6 bar three
7 bar three
8 bar three
9 foo three
In [77]: left.dtypes
Out[77]:
X category
Y object
dtype: object
和 right
对象
In [78]: right = pd.DataFrame(
....: {
....: "X": pd.Series(["foo", "bar"], dtype=CategoricalDtype(["foo", "bar"])),
....: "Z": [1, 2],
....: }
....: )
....:
In [79]: right
Out[79]:
X Z
0 foo 1
1 bar 2
In [80]: right.dtypes
Out[80]:
X category
Z int64
dtype: object
合并的结果为
In [81]: result = pd.merge(left, right, how="outer")
In [82]: result
Out[82]:
X Y Z
0 bar one 2
1 bar three 2
2 bar one 2
3 bar three 2
4 bar three 2
5 bar three 2
6 foo one 1
7 foo three 1
8 foo one 1
9 foo three 1
In [83]: result.dtypes
Out[83]:
X category
Y object
Z int64
dtype: object
5 join 连接
DataFrame.join()
是一个简便的方法,能够通过索引连接两个 DataFrame
对象
例如
In [84]: left = pd.DataFrame(
....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
....: )
....:
In [85]: right = pd.DataFrame(
....: {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
....: )
....:
In [86]: result = left.join(right)
In [87]: result = left.join(right, how="outer")
In [88]: result = left.join(right, how="inner")
当然,也可以使用 merge
实现相同的功能,但是需要写更多的代码
In [89]: result = pd.merge(left, right, left_index=True, right_index=True, how="outer")
In [90]: result = pd.merge(left, right, left_index=True, right_index=True, how="inner")
6 通过列和索引连接
jion()
接受一个 on
参数,用于指定该对象中用于连接的列名或列名列表与传入的 DataFrame
的索引进行连接
下面两个函数的功能是一样的
left.join(right, on=key_or_keys)
pd.merge(
left, right, left_on=key_or_keys, right_index=True, how="left", sort=False
)
显然,你可以选择自己认为的更简便的方式,对于多对一的连接,使用 join()
可能会更加方便
例如
In [91]: left = pd.DataFrame(
....: {
....: "A": ["A0", "A1", "A2", "A3"],
....: "B": ["B0", "B1", "B2", "B3"],
....: "key": ["K0", "K1", "K0", "K1"],
....: }
....: )
....:
In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])
In [93]: result = left.join(right, on="key")
In [94]: result = pd.merge(
....: left, right, left_on="key", right_index=True, how="left", sort=False
....: )
....:
如果要连接多个键,传递的 DataFrame
必须有一个 MultiIndex
In [95]: left = pd.DataFrame(
....: {
....: "A": ["A0", "A1", "A2", "A3"],
....: "B": ["B0", "B1", "B2", "B3"],
....: "key1": ["K0", "K0", "K1", "K2"],
....: "key2": ["K0", "K1", "K0", "K1"],
....: }
....: )
....:
In [96]: index = pd.MultiIndex.from_tuples(
....: [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
....: )
....:
In [97]: right = pd.DataFrame(
....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
....: )
....:
现在可以通过传递两个列名来连接
In [98]: result = left.join(right, on=["key1", "key2"])
DataFrame.join
默认执行的是 left join
,想要实现其他连接也很方便
例如,inner join
In [99]: result = left.join(right, on=["key1", "key2"], how="inner")
7 单索引和层次索引的连接
可以将单索引与层次索引的某一级别进行连接
In [100]: left = pd.DataFrame(
.....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
.....: index=pd.Index(["K0", "K1", "K2"], name="key"),
.....: )
.....:
In [101]: index = pd.MultiIndex.from_tuples(
.....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
.....: names=["key", "Y"],
.....: )
.....:
In [102]: right = pd.DataFrame(
.....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
.....: index=index,
.....: )
.....:
In [103]: result = left.join(right, how="inner")
下面的代码与上面是等价的,但是代码更多
In [104]: result = pd.merge(
.....: left.reset_index(), right.reset_index(), on=["key"], how="inner"
.....: ).set_index(["key","Y"])
.....:
8 两个 MultiIndex 的连接
使用该方式是有限制的,right
对象的索引必须是 left
对象索引的子集
In [105]: leftindex = pd.MultiIndex.from_product(
.....: [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
.....: )
.....:
In [106]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)
In [107]: left
Out[107]:
v1
abc xy num
a x 1 0
2 1
y 1 2
2 3
b x 1 4
2 5
y 1 6
2 7
c x 1 8
2 9
y 1 10
2 11
In [108]: rightindex = pd.MultiIndex.from_product(
.....: [list("abc"), list("xy")], names=["abc", "xy"]
.....: )
.....:
In [109]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
In [110]: right
Out[110]:
v2
abc xy
a x 100
y 200
b x 300
y 400
c x 500
y 600
In [111]: left.join(right, on=["abc", "xy"], how="inner")
Out[111]:
v1 v2
abc xy num
a x 1 0 100
2 1 100
y 1 2 200
2 3 200
b x 1 4 300
2 5 300
y 1 6 400
2 7 400
c x 1 8 500
2 9 500
y 1 10 600
2 11 600
如果不满足该条件,则可以使用下面的代码连接两个具有层次索引的 DataFrame
In [112]: leftindex = pd.MultiIndex.from_tuples(
.....: [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
.....: )
.....:
In [113]: left = pd.DataFrame(
.....: {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
.....: )
.....:
In [114]: rightindex = pd.MultiIndex.from_tuples(
.....: [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
.....: )
.....:
In [115]: right = pd.DataFrame(
.....: {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
.....: )
.....:
In [116]: result = pd.merge(
.....: left.reset_index(), right.reset_index(), on=["key"], how="inner"
.....: ).set_index(["key", "X", "Y"])
.....:
9 列名与索引的组合进行连接
on
、left_on
和 right_on
参数传递的字符串可以是列名或索引级别名称,这就可以在不重置索引的情况下,对索引级别和列名进行组合,合并两个 DataFrame
实例对象
In [117]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")
In [118]: left = pd.DataFrame(
.....: {
.....: "A": ["A0", "A1", "A2", "A3"],
.....: "B": ["B0", "B1", "B2", "B3"],
.....: "key2": ["K0", "K1", "K0", "K1"],
.....: },
.....: index=left_index,
.....: )
.....:
In [119]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")
In [120]: right = pd.DataFrame(
.....: {
.....: "C": ["C0", "C1", "C2", "C3"],
.....: "D": ["D0", "D1", "D2", "D3"],
.....: "key2": ["K0", "K0", "K0", "K1"],
.....: },
.....: index=right_index,
.....: )
.....:
In [121]: result = left.merge(right, on=["key1", "key2"])
注意:如果用于连接的字符串名称既匹配了索引级别名,又匹配了列名,将会引发一个警告,并有效使用列名
10 交叠的列
merge
的 suffix
参数接受一个字符串元组列表,作为合并结果中相同列名的后缀,以便于区别列来自哪个 DataFrame
In [122]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
In [123]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
In [124]: result = pd.merge(left, right, on="k")
In [125]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))
而 DataFrame.join()
中的 lsuffix
和 rsuffix
参数也可以达到一样的效果
In [126]: left = left.set_index("k")
In [127]: right = right.set_index("k")
In [128]: result = left.join(right, lsuffix="_l", rsuffix="_r")
11 连接多个 DataFrame
也可以将 DataFrame
列表或元组传递给 join()
,将会根据它们的索引连接在一起
In [129]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])
In [130]: result = left.join([right, right2])
12 合并相似的 Series 或 DataFrame
另一种常见的情况是,有两个具有类似索引的 Series
或 DataFrame
对象,并希望用一个对象的值来修复另一个对象中对应位置的值
例如
In [131]: df1 = pd.DataFrame(
.....: [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
.....: )
.....:
In [132]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
使用 combine_first()
来实现修复
In [133]: result = df1.combine_first(df2)
注意:这个方法只有当 left
对象中存在缺失值,才会用 right
对象中对应位置的值来替换
而相关的 update()
方法,会在原地修改并替换非缺失值
In [134]: df1.update(df2)