将熊猫数据框与关键重复项合并
我有2个数据框,两个数据框都有一个可能有重复的键列,但这些数据框大多具有相同的重复键。我想将这些数据帧合并到该键上,但是以这样的方式,当两个数据帧具有相同的重复项时,这些重复项将分别合并。另外,如果一个数据框比另一个数据框具有更多的重复键,我希望将其值填充为NaN。例如:
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']},
columns=['key', 'A'])
df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']},
columns=['key', 'B'])
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K2 A3
4 K2 A4
5 K3 A5
key B
0 K0 B0
1 K1 B1
2 K2 B2
3 K2 B3
4 K3 B4
5 K3 B5
6 K4 B6
我正在尝试获得以下输出
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K2 A3 B3
6 K2 A4 NaN
8 K3 A5 B4
9 K3 NaN B5
10 K4 NaN B6
因此,基本上,我想将重复的K2键视为K2_1,K2_2 …,然后在数据帧上进行how =’outer’合并。有什么想法我可以做到这一点吗?
-
再快一点
%%cython # using cython in jupyter notebook # in another cell run `%load_ext Cython` from collections import defaultdict import numpy as np def cg(x): cnt = defaultdict(lambda: 0) for j in x.tolist(): cnt[j] += 1 yield cnt[j] def fastcount(x): return [i for i in cg(x)] df1['cc'] = fastcount(df1.key.values) df2['cc'] = fastcount(df2.key.values) df1.merge(df2, how='outer').drop('cc', 1)
更快的答案; 不可扩展
def fastcount(x): unq, inv = np.unique(x, return_inverse=1) m = np.arange(len(unq))[:, None] == inv return (m.cumsum(1) * m).sum(0) df1['cc'] = fastcount(df1.key.values) df2['cc'] = fastcount(df2.key.values) df1.merge(df2, how='outer').drop('cc', 1)
旧答案
df1['cc'] = df1.groupby('key').cumcount() df2['cc'] = df2.groupby('key').cumcount() df1.merge(df2, how='outer').drop('cc', 1)