不平等加入了熊猫?

发布于 2021-01-29 16:07:16

我通常使用Dataframe.merge组合熊猫中的数据框。据我了解,这仅适用于平等联接。使用其他类型的检查(例如,不等式)来联接两个数据框的惯用方式是什么?

关注者
0
被浏览
45
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    熊猫合并()允许outerleftright连接(而不仅仅是inner连接)两个数据帧之间,这样你就可以返回匹配的记录。此外,merge()甚至可以泛化为返回交叉联接(两个数据帧之间的所有组合匹配),并且随后进行过滤可以返回不匹配的记录。还有,还有isin() pandas方法。

    考虑下面的演示。下面是我们喜欢的两种数据框架, 计算机语言
    。如图所示,第一数据帧是第二数据帧的子集。外部联接返回的记录都包含NaN不匹配的列,以后可以将其过滤掉。交叉联接返回可以过滤的完整完整行,并isin()在列中搜索值:

    import pandas as pd
    
    df1 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP'],
                        'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web'], 
                        'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source', 'Open-Source']})
    
    df2 = pd.DataFrame({'Languages': ['C++', 'C', 'Java', 'C#', 'Python', 'PHP',
                                     'Perl', 'R', 'Ruby', 'VB.NET', 'Javascript', 'Matlab'],
                        'Uses': ['computing', 'computing', 'application', 'application', 'application', 'web',
                                'application', 'computing', 'web', 'application', 'web', 'computing'],
                        'Type': ['Proprietary', 'Proprietary', 'Proprietary', 'Proprietary', 'Open-Source',
                                'Open-Source', 'Open-Source', 'Open-Source', 'Open-Source', 'Proprietary',
                                'Open-Source', 'Proprietary']})
    
    # OUTER JOIN 
    mergedf = pd.merge(df1, df2, on=['Languages'], how='outer')
    # FILTER OUT LANGUAGES IN SMALLER THAT IS NULL
    mergedf = mergedf[pd.isnull(mergedf['Type_x'])][['Languages', 'Uses_y', 'Type_y']]
    
    #     Languages       Uses_y       Type_y
    #6         Perl  application  Open-Source
    #7            R    computing  Open-Source
    #8         Ruby          web  Open-Source
    #9       VB.NET  application  Proprietary
    #10  Javascript          web  Open-Source
    #11      Matlab    computing  Proprietary
    
    
    # ISIN COMPARISON, RETURNING RECORDS IN LARGER NOT IN SMALLER
    unequaldf = df2[~df2.Languages.isin(df1['Languages'])]
    
    #     Languages         Type         Uses
    #6         Perl  Open-Source  application
    #7            R  Open-Source    computing
    #8         Ruby  Open-Source          web
    #9       VB.NET  Proprietary  application
    #10  Javascript  Open-Source          web
    #11      Matlab  Proprietary    computing
    
    
    # CROSS JOIN 
    df1['key'] = 1                 # (REQUIRES A JOIN KEY OF SAME VALUE)
    df2['key'] = 1                    
    crossjoindf = pd.merge(df1, df2, on=['key'])
    # FILTER FOR LANGUAGES IN LARGER NOT IN SMALLER (ALSO USING ISIN)
    crossjoindf = crossjoindf[~crossjoindf['Languages_y'].isin(crossjoindf['Languages_x'])]\
                        [['Languages_y', 'Uses_y', 'Type_y']].drop_duplicates()
    
    #   Languages_y       Uses_y       Type_y
    #6         Perl  application  Open-Source
    #7            R    computing  Open-Source
    #8         Ruby          web  Open-Source
    #9       VB.NET  application  Proprietary
    #10  Javascript          web  Open-Source
    #11      Matlab    computing  Proprietary
    

    诚然,交叉连接在这里可能是多余且冗长的,但是如果您无与伦比的需求需要跨数据帧进行排列,那么它会很方便。



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看