如何在字符串包含上合并pandas?
我有2个数据框,我想将它们合并到一个公共列上。但是,我要合并的列不是同一字符串,而是另一个中包含一个字符串,如下所示:
import pandas as pd
df1 = pd.DataFrame({'column_a':['John','Michael','Dan','George', 'Adam'], 'column_common':['code','other','ome','no match','word']})
df2 = pd.DataFrame({'column_b':['Smith','Cohen','Moore','K', 'Faber'], 'column_common':['some string','other string','some code','this code','word']})
我想要的结果d1.merge(d2, ...)
如下:
column_a | column_b
----------------------
John | Moore <- merged on 'code' contained in 'some code'
Michael | Cohen <- merged on 'other' contained in 'other string'
Dan | Smith <- merged on 'ome' contained in 'some string'
George | n/a
Adam | Faber <- merged on 'word' contained in 'word'
-
新答案
这是一种基于pandas / numpy的方法。
rhs = (df1.column_common .apply(lambda x: df2[df2.column_common.str.find(x).ge(0)]['column_b']) .bfill(axis=1) .iloc[:, 0]) (pd.concat([df1.column_a, rhs], axis=1, ignore_index=True) .rename(columns={0: 'column_a', 1: 'column_b'})) column_a column_b 0 John Moore 1 Michael Cohen 2 Dan Smith 3 George NaN 4 Adam Faber
旧答案
这是左联接行为的一种解决方案,因为它不会保留
column_a
不匹配任何column_b
值的值。这比上面的numpy /
pandas解决方案要慢,因为它使用两个嵌套iterrows
循环来构建python列表。tups = [(a1, a2) for i, (a1, b1) in df1.iterrows() for j, (a2, b2) in df2.iterrows() if b1 in b2] (pd.DataFrame(tups, columns=['column_a', 'column_b']) .drop_duplicates('column_a') .reset_index(drop=True)) column_a column_b 0 John Moore 1 Michael Cohen 2 Dan Smith 3 Adam Faber