联合与合并数据集
- 数据库风格的DataFrame连接
- 搜索索引合并
- 沿轴向连接
- 联合重叠数据
数据库风格的DataFrame连接
我们可以使用pandas中的merge函数,将两个DataFrame相连,先看多对一的列子:
import pandas as pd
df1 = pd.DataFrame({"key":list('bbacaab'),'data1':range(7)})
# key data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 a 5
# 6 b 6
df2 = pd.DataFrame({'key':list('abd'),'data2':range(3)})
# key data2
# 0 a 0
# 1 b 1
# 2 d 2
print(pd.merge(df1,df2))
# key data1 data2
# 0 b 0 1
# 1 b 1 1
# 2 b 6 1
# 3 a 2 0
# 4 a 4 0
# 5 a 5 0
对以上代码进行以下说明:
- 当没有指定按哪一列进行连接的情况下,会默认将重列列名作为连接的键。可以使用left_on、right_on(在需要连接的列名不同的情况下)和on(相同的情况下)来指定连接的列。
2.我们可以看到结果中缺少c和d的值,这是因为merge采取的是内连接(inner),结果中的键是两张表的交集,其他连接方式还有外连接(outer)、左连接(left)和右连接(right),其中外连接是键的并集。
print(pd.merge(df1,df2,how='outer'))
# key data1 data2
# 0 b 0.0 1.0
# 1 b 1.0 1.0
# 2 b 6.0 1.0
# 3 a 2.0 0.0
# 4 a 4.0 0.0
# 5 a 5.0 0.0
# 6 c 3.0 NaN
# 7 d NaN 2.0
print(pd.merge(df1,df2,how='left'))
# key data1 data2
# 0 b 0 1.0
# 1 b 1 1.0
# 2 a 2 0.0
# 3 c 3 NaN
# 4 a 4 0.0
# 5 a 5 0.0
# 6 b 6 1.0
print(pd.merge(df1,df2,how='right'))
# key data1 data2
# 0 b 0.0 1
# 1 b 1.0 1
# 2 b 6.0 1
# 3 a 2.0 0
# 4 a 4.0 0
# 5 a 5.0 0
# 6 d NaN 2
下面考虑多对多(两个DataFrame中一个键值对应着多个数值)的合并:
import pandas as pd
df1 = pd.DataFrame({'key':list('bbacab'),
'data1':range(6)})
# key data1
# 0 b 0
# 1 b 1
# 2 a 2
# 3 c 3
# 4 a 4
# 5 b 5
df2 = pd.DataFrame({'key':list('ababd'),
'data2':range(5)})
# key data2
# 0 a 0
# 1 b 1
# 2 a 2
# 3 b 3
# 4 d 4
print(pd.merge(df1,df2,on = 'key',how = 'outer'))
# key data1 data2
# 0 b 0 1.0
# 1 b 0 3.0
# 2 b 1 1.0
# 3 b 1 3.0
# 4 a 2 0.0
# 5 a 2 2.0
# 6 c 3 NaN
# 7 a 4 0.0
# 8 a 4 2.0
# 9 b 5 1.0
# 10 b 5 3.0
这里我们注意到,df1中有3个’b’行,df2中有2个’b’行,因此结果中有6个’b’行
使用多个键进行合并时,传入的是一个列名的列表:
import pandas as pd
left = pd.DataFrame({'key1':['foo','foo','bar'],
'key2':['one','two','one'],
'lval':[1,2,3]})
# key1 key2 lval
# 0 foo one 1
# 1 foo two 2
# 2 bar one 3
right = pd.DataFrame({'key1':['foo','foo','bar','bar'],
'key2':['one','one','one','two'],
'rval':[4,5,6,7]})
# key1 key2 rval
# 0 foo one 4
# 1 foo one 5
# 2 bar one 6
# 3 bar two 7
print(pd.merge(left,right,on = ['key1','key2'],how = 'outer'))
# key1 key2 lval rval
# 0 foo one 1.0 4.0
# 1 foo one 1.0 5.0
# 2 foo two 2.0 NaN
# 3 bar one 3.0 6.0
# 4 bar two NaN 7.0
在上面的例子中,倘若我们连接的键是’key1’而不是’key1’和’key2’,那么就会产生重名问题,这种情况下pandas会默认为其加上后缀:
print(pd.merge(left,right,on = 'key1',how = 'outer'))
# key1 key2_x lval key2_y rval
# 0 foo one 1 one 4
# 1 foo one 1 one 5
# 2 foo two 2 one 4
# 3 foo two 2 one 5
# 4 bar one 3 one 6
# 5 bar one 3 two 7
我们可以指定suffixes来指定标签在重名后添加的后缀:
print(pd.merge(left,right,on = 'key1',how = 'outer',suffixes=('_left','_right')))
# key1 key2_left lval key2_right rval
# 0 foo one 1 one 4
# 1 foo one 1 one 5
# 2 foo two 2 one 4
# 3 foo two 2 one 5
# 4 bar one 3 one 6
# 5 bar one 3 two 7
以下是merge函数的参数:
参数 | 描述 |
left | 合并时操作左边的DataFrame |
right | 合并时操作左边的DataFrame |
how | 连接的方式 |
on | 需要连接的列名。必须是两边的DataFrame对象都有的列名,并以left和right中的列名的交集作为连接键 |
left_on | left DataFrame中用作连接键的列 |
right_on | right DataFrame中用作连接键的列 |
sort | 通过连接键的字母顺序对合并的数据进行排序;默认为True |
suffixes | 在重叠的情况下,添加到列名后的字符串元组;默认是(_x,_y) |
copy | 如果为False,则在某些特殊情况下避免将数据复制到结果数据结果中,默认总是复制 |
indicator | 添加一个列_merge,指示每一行的来源,值将根据每行中的连接数据的来源分别为‘left_only’,‘right_only’,‘both’ |
搜索索引合并
我们可以将right_index和left_index的值指定为True来将索引作为合并的键:
import pandas as pd
left = pd.DataFrame({'key':list('abaabc'),
'value':range(6)})
# key value
# 0 a 0
# 1 b 1
# 2 a 2
# 3 a 3
# 4 b 4
# 5 c 5
right = pd.DataFrame({'group_val':[3.5,7]},
index = ['a','b'])
# group_val
# a 3.5
# b 7.0
print(pd.merge(left,right,left_on='key',right_index=True,how = 'outer'))
# key value group_val
# 0 a 0 3.5
# 2 a 2 3.5
# 3 a 3 3.5
# 1 b 1 7.0
# 4 b 4 7.0
# 5 c 5 NaN
当具有多层索引时,情况更加的复杂:
left = pd.DataFrame({'key1':['O','O','O','N','N'],
'key2':[2000,2001,2002,2001,2002],
'data':np.arange(5.)})
# key1 key2 data
# 0 O 2000 0.0
# 1 O 2001 1.0
# 2 O 2002 2.0
# 3 N 2001 3.0
# 4 N 2002 4.0
right = pd.DataFrame(np.arange(12).reshape((6,2)),
index = [['N','N','O','O','O','O'],
[2001,2000,2000,2000,2001,2002]],
columns = ['envent1','evnet2'])
# envent1 evnet2
# N 2001 0 1
# 2000 2 3
# O 2000 4 5
# 2000 6 7
# 2001 8 9
# 2002 10 11
print(pd.merge(left,right,left_on=['key1','key2'],right_index=True))
# key1 key2 data envent1 evnet2
# 0 O 2000 0.0 4 5
# 0 O 2000 0.0 6 7
# 1 O 2001 1.0 8 9
# 2 O 2002 2.0 10 11
# 3 N 2001 3.0 0 1
print(pd.merge(left,right,left_on=['key1','key2'],right_index=True,how='outer'))
# key1 key2 data envent1 evnet2
# 0 O 2000 0.0 4.0 5.0
# 0 O 2000 0.0 6.0 7.0
# 1 O 2001 1.0 8.0 9.0
# 2 O 2002 2.0 10.0 11.0
# 3 N 2001 3.0 0.0 1.0
# 4 N 2002 4.0 NaN NaN
# 4 N 2000 NaN 2.0 3.0
也可以使用两边的索引进行合并,用法与上面类似,将left_index和right_index都设置为True即可。对于索引之间的合并我们还可以使用join方法:
left2 = pd.DataFrame([[1.,2.],[3.,4.],[5.,6.]],
index=list('abc'),columns=['O','N'])
# O N
# a 1.0 2.0
# b 3.0 4.0
# c 5.0 6.0
right2 = pd.DataFrame([[7.,8.],[9.,10.],[11.,12.],[13,14]],
index = list('abcd'),columns=['M','A'])
# M A
# a 7.0 8.0
# b 9.0 10.0
# c 11.0 12.0
# d 13.0 14.0
print(pd.merge(left2,right2,right_index=True,left_index=True))
# O N M A
# a 1.0 2.0 7.0 8.0
# b 3.0 4.0 9.0 10.0
# c 5.0 6.0 11.0 12.0
print(left2.join(right2,how='outer'))
# O N M A
# a 1.0 2.0 7.0 8.0
# b 3.0 4.0 9.0 10.0
# c 5.0 6.0 11.0 12.0
# d NaN NaN 13.0 14.0
沿轴向连接
我们可以使用pandas中的cancat函数来进行轴向连接,默认是沿着axis=0生效的,此时会返回一个Series,如果axis=1则返回一个DataFrame:
s1 = pd.Series([0,1],index=list('ab'))
s2 = pd.Series([2,3,4],index=list('cde'))
s3 = pd.Series([5,6],index=list('fg'))
print(pd.concat([s1,s2,s3]))
# a 0
# b 1
# c 2
# d 3
# e 4
# f 5
# g 6
# dtype: int64
print(pd.concat([s1,s2,s3],axis = 1,sort=False))
# 0 1 2
# a 0.0 NaN NaN
# b 1.0 NaN NaN
# c NaN 2.0 NaN
# d NaN 3.0 NaN
# e NaN 4.0 NaN
# f NaN NaN 5.0
# g NaN NaN 6.0
如果我们在进行连接时想加上索引以进行区分,可以使用keys参数实现:
print(pd.concat([s1,s2,s3],keys=['one','two','three']))
# one a 0
# b 1
# two c 2
# d 3
# e 4
# three f 5
# g 6
# dtype: int64
print(pd.concat([s1,s2,s3],keys=['one','two','three'],axis=1,sort=False))
# one two three
# a 0.0 NaN NaN
# b 1.0 NaN NaN
# c NaN 2.0 NaN
# d NaN 3.0 NaN
# e NaN 4.0 NaN
# f NaN NaN 5.0
# g NaN NaN 6.0
对于DataFrame,concat的用法类似:
df1 = pd.DataFrame(np.arange(6).reshape(3,2),index=list('abc'),
columns=['one','two'])
# one two
# a 0 1
# b 2 3
# c 4 5
df2 = pd.DataFrame(5+np.arange(4).reshape(2,2),index=['a','b'],
columns=['three','four'])
# three four
# a 5 6
# b 7 8
print(pd.concat([df1,df2],axis=1,keys=['lv1','lv2'],sort=False))
# lv1 lv2
# one two three four
# a 0 1 5.0 6.0
# b 2 3 7.0 8.0
# c 4 5 NaN NaN
对于按行连接,有:
df1 = pd.DataFrame(np.random.randn(3,4),columns=list('abcd'))
# a b c d
# 0 -0.804995 0.943988 -1.182787 -0.137897
# 1 0.134306 1.961522 0.752514 -0.100846
# 2 -0.129814 0.036155 -0.449419 0.005875
df2 = pd.DataFrame(np.random.randn(2,3),columns=list('bda'))
# b d a
# 0 -1.404842 2.205611 0.597599
# 1 0.662636 -1.347026 0.069774
print(pd.concat([df1,df2],ignore_index=True,sort=False))
# a b c d
# 0 -0.804995 0.943988 -1.182787 -0.137897
# 1 0.134306 1.961522 0.752514 -0.100846
# 2 -0.129814 0.036155 -0.449419 0.005875
# 3 0.597599 -1.404842 NaN 2.205611
# 4 0.069774 0.662636 NaN -1.347026
其中ignore_index设为True则表示产生新的轴索引。
concat函数的参数见下表:
参数 | 描述 |
objs | 需要连接的pandas对象列表或者字典;必选参数 |
axis | 连接的轴向;默认是0(沿着行方向) |
join | 可以是’inner’或者’outer’ (默认为’outer’),用于指定连接方式是内连接或者外连接 |
join_axis | 用于指定其他n-1轴的特定索引,可以替代内/外连接的逻辑 |
keys | 与要连接的对象关联的值,沿着连接轴形成分层索引;可以是任意值的列表或者数组,也可以是元组的数组,也可以是数组的列表 |
levels | 在键值传递时,该参数用于指示多层索引的层级 |
names | 如果传入了keys或者levels参数,该参数用于多层索引的层级名称 |
verify_integrity | 检查连接对象中的新轴是否重复,如果是则引发异常,默认为False |
ignore_index | 不沿着连接轴保留索引,而是产生一段新的索引 |
联合重叠数据
使用combine_frist方法可以将对象中的缺失部分用传入对象中对应的部分填补:
a = pd.Series([np.nan,2.5,0.0,3.5,4.5,np.nan],
index=list('fedcba'))
# f NaN
# e 2.5
# d 0.0
# c 3.5
# b 4.5
# a NaN
# dtype: float64
b = pd.Series([0.,np.nan,2,np.nan,np.nan,5.],
index=list('abcdef'))
# a 0.0
# b NaN
# c 2.0
# d NaN
# e NaN
# f 5.0
print(b.combine_first(a))
# a 0.0
# b 4.5
# c 2.0
# d 0.0
# e 2.5
# f 5.0
# dtype: float64
在DataFrame对象中,combine_frist逐列做相同的操作:
df1 = pd.DataFrame({
'a':[1.,np.nan,5.,np.nan],
'b':[np.nan,2,np.nan,6.],
'c':range(2,18,4)
})
# a b c
# 0 1.0 NaN 2
# 1 NaN 2.0 6
# 2 5.0 NaN 10
# 3 NaN 6.0 14
df2 = pd.DataFrame({
'a':[5.,4.,np.nan,3.,7.],
'b':[np.nan,3.,4.,6.,8.]
})
# a b
# 0 5.0 NaN
# 1 4.0 3.0
# 2 NaN 4.0
# 3 3.0 6.0
# 4 7.0 8.0
print(df1.combine_first(df2))
# a b c
# 0 1.0 NaN 2.0
# 1 4.0 2.0 6.0
# 2 5.0 4.0 10.0
# 3 3.0 6.0 14.0
# 4 7.0 8.0 NaN