不平等加入了熊猫?
我通常使用Dataframe.merge组合熊猫中的数据框。据我了解,这仅适用于平等联接。使用其他类型的检查(例如,不等式)来联接两个数据框的惯用方式是什么?
-
熊猫合并()允许
outer
,left
,right
连接(而不仅仅是inner
连接)两个数据帧之间,这样你就可以返回匹配的记录。此外,merge()
甚至可以泛化为返回交叉联接(两个数据帧之间的所有组合匹配),并且随后进行过滤可以返回不匹配的记录。还有,还有isin() pandas方法。考虑下面的演示。下面是我们喜欢的两种数据框架, 计算机语言
。如图所示,第一数据帧是第二数据帧的子集。外部联接返回的记录都包含NaN
不匹配的列,以后可以将其过滤掉。交叉联接返回可以过滤的完整完整行,并isin()
在列中搜索值:import pandas as pd df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'], 'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'], 'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']}) df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP', 'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'], 'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web', 'application', 'computing', 'web', 'application', 'web', 'computing'], 'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary', 'Open-Source', 'Proprietary']}) # OUTER JOIN mergedf = pd.merge(df1, df2, on=['Languages'], how='outer') # FILTER OUT LANGUAGES IN SMALLER THAT IS NULL mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']] # Languages Uses_y Type_y #6 Perl application Open-Source #7 R computing Open-Source #8 Ruby web Open-Source #9 VB.NET application Proprietary #10 Javascript web Open-Source #11 Matlab computing Proprietary # ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER unequaldf = df2[~df2.Languages.isin(df1['Languages'])] # Languages Type Uses #6 Perl Open-Source application #7 R Open-Source computing #8 Ruby Open-Source web #9 VB.NET Proprietary application #10 Javascript Open-Source web #11 Matlab Proprietary computing # CROSS JOIN df1['key'] = 1 # (REQUIRES A JOIN KEY OF SAME VALUE) df2['key'] = 1 crossjoindf = pd.merge(df1, df2, on=['key']) # FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN) crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\ [['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates() # Languages_y Uses_y Type_y #6 Perl application Open-Source #7 R computing Open-Source #8 Ruby web Open-Source #9 VB.NET application Proprietary #10 Javascript web Open-Source #11 Matlab computing Proprietary
诚然,交叉连接在这里可能是多余且冗长的,但是如果您无与伦比的需求需要跨数据帧进行排列,那么它会很方便。