本节讲述对于两个数据集按照相同列的值进行合并。
首先定义原始数据:
import pandas as pdimport numpy as npdata0 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'] })data1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'] })print("data0:")print(data0)print("data1:")print(data1)
输出为:
data0: A B key0 A0 B0 K01 A1 B1 K12 A2 B2 K23 A3 B3 K3data1: C D key0 C0 D0 K01 C1 D1 K12 C2 D2 K23 C3 D3 K3
啥也不做,直接合并:
print(pd.merge(data0, data1))
输出为:
A B key C D0 A0 B0 K0 C0 D01 A1 B1 K1 C1 D12 A2 B2 K2 C2 D23 A3 B3 K3 C3 D3
默认情况下的合并是根据两个数据集中共同的列拥有相同的值来进行合并的。
我们再举一个例子,大家可以观察下:
import pandas as pdimport numpy as npdata0 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 'key2': ['K0', 'K1', 'K0', 'K1'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'] })data1 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'], 'key2': ['K0', 'K0', 'K0', 'K0'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'] })print("data0:")print(data0)print("data1:")print(data1)print("合并后的数据为:")print(pd.merge(data0, data1))
输出为:
data0: A B key1 key20 A0 B0 K0 K01 A1 B1 K0 K12 A2 B2 K1 K03 A3 B3 K2 K1data1: C D key1 key20 C0 D0 K0 K01 C1 D1 K1 K02 C2 D2 K1 K03 C3 D3 K2 K0合并后的数据为: A B key1 key2 C D0 A0 B0 K0 K0 C0 D01 A2 B2 K1 K0 C1 D12 A2 B2 K1 K0 C2 D2
在merge参数中可以添加how的参数,这个参数默认为inner,可选值有:
left,right,outer,inner。对于how='outer'
不管key有没有一模一样,都把它们给复制下来,例如:print(pd.merge(data0, data1, how='outer'))
输出为:
A B key1 key2 C D0 A0 B0 K0 K0 C0 D01 A1 B1 K0 K1 NaN NaN2 A2 B2 K1 K0 C1 D13 A2 B2 K1 K0 C2 D24 A3 B3 K2 K1 NaN NaN5 NaN NaN K2 K0 C3 D3
如果我们设置how='right',则输出结果会以第二个数据集的key为准:
print(pd.merge(data0, data1, how='right'))
输出为:
A B key1 key2 C D0 A0 B0 K0 K0 C0 D01 A2 B2 K1 K0 C1 D12 A2 B2 K1 K0 C2 D23 NaN NaN K2 K0 C3 D3
indicator
indicator参数用来指示出当前记录的合并方式,例如:
print(pd.merge(data0, data1, indicator=True, how='outer'))
输出:
A B key1 key2 C D _merge0 A0 B0 K0 K0 C0 D0 both1 A1 B1 K0 K1 NaN NaN left_only2 A2 B2 K1 K0 C1 D1 both3 A2 B2 K1 K0 C2 D2 both4 A3 B3 K2 K1 NaN NaN left_only5 NaN NaN K2 K0 C3 D3 right_only
目前,indicator默认的列名为 _merge,如果你看着不爽,可以通过indicator="字段名"的方式来修改这个字段名。
按照index进行合并
前面是通过字段名来进行合并的,但有时我们可以把index看成是一个主键,这样就相当于根据主键进行合并数据,例如:
import pandas as pdimport numpy as npdata0 = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'], 'key2': ['K0', 'K1', 'K0', 'K1'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'] }, index=["T0", "T1", "T2", "T3"])data1 = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'], 'key2': ['K0', 'K0', 'K0', 'K0'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'] }, index=["T0", "T1", "T4", "T5"])print("data0:")print(data0)print("data1:")print(data1)print("合并后的数据为:")print(pd.merge(data0, data1, left_index=True, right_index=True))
输出为:
data0: A B key1 key2T0 A0 B0 K0 K0T1 A1 B1 K0 K1T2 A2 B2 K1 K0T3 A3 B3 K2 K1data1: C D key1 key2T0 C0 D0 K0 K0T1 C1 D1 K1 K0T4 C2 D2 K1 K0T5 C3 D3 K2 K0合并后的数据为: A B key1_x key2_x C D key1_y key2_yT0 A0 B0 K0 K0 C0 D0 K0 K0T1 A1 B1 K0 K1 C1 D1 K1 K0
这里需要同时设置left_index=True, right_index=True
相同列名添加后缀区分
如果我们不加任何后缀的情况下,系统会自动添加_x,_y之类的后缀进行区分,例如:
import pandas as pdimport numpy as npdata0 = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})data1 = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})print("data0:")print(data0)print("data1:")print(data1)print("合并后的数据为:")print(pd.merge(data0, data1, on='k'))
输出为:
data0: age k0 1 K01 2 K12 3 K2data1: age k0 4 K01 5 K02 6 K3合并后的数据为: age_x k age_y0 1 K0 41 1 K0 5
我们可以通过suffixes属性来修改默认的后缀名:
print(pd.merge(data0, data1, on='k', suffixes=['_boy', '_girl']))
输出为:
age_boy k age_girl0 1 K0 41 1 K0 5