Python:如何通过保留第一个数据框的信息来合并列上的两个数据框?

发布于 2021-02-02 23:21:48

我有两个数据框df1和df2。df1包含人的年龄信息,而df2包含人的性别信息。并非所有人都在里面df1或里面df2

df1
     Name   Age 
0     Tom    34
1     Sara   18
2     Eva    44
3     Jack   27
4     Laura  30

df2
     Name      Sex 
0     Tom       M
1     Paul      M
2     Eva       F
3     Jack      M
4     Michelle  F

我想有人民的性别的信息df1和设置NaN,如果我没有在这个信息df2。我尝试这样做,df1 = pd.merge(df1, df2, on = 'Name', how = 'outer')但是我保留了一些df2我不想要的信息。


df1
     Name   Age     Sex
0     Tom    34      M
1     Sara   18     NaN
2     Eva    44      F
3     Jack   27      M
4     Laura  30     NaN
关注者
0
被浏览
154
1 个回答
  • 面试哥
    面试哥 2021-02-02
    为面试而生,有面试问题,就找面试哥。
    Sample:
    
    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    #print (df1)
    df3 = df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F']})
    #print (df2)
    

    使用map由Series创建人set_index

    df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
    print (df1)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    merge左连接的替代解决方案:

    df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
    print (df)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    如果需要通过多列映射(例如Year和Code),则需要merge左连接:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Year':[2000,2003,2003,2004,2007],
                        'Code':[1,2,3,4,4],
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Year  Code  Age
    0    Tom  2000     1   34
    1   Sara  2003     2   18
    2    Eva  2003     3   44
    3   Jack  2004     4   27
    4  Laura  2007     4   30
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F'],
                        'Year':[2001,2003,2003,2004,2007],
                        'Code':[1,2,3,5,3],
                        'Val':[21,34,23,44,67]})
    print (df2)
           Name Sex  Year  Code  Val
    0       Tom   M  2001     1   21
    1      Paul   M  2003     2   34
    2       Eva   F  2003     3   23
    3      Jack   M  2004     5   44
    4  Michelle   F  2007     3   67
    
    #merge by all columns
    df = df1.merge(df2, on=['Year','Code'], how='left')
    print (df)
      Name_x  Year  Code  Age Name_y  Sex   Val
    0    Tom  2000     1   34    NaN  NaN   NaN
    1   Sara  2003     2   18   Paul    M  34.0
    2    Eva  2003     3   44    Eva    F  23.0
    3   Jack  2004     4   27    NaN  NaN   NaN
    4  Laura  2007     4   30    NaN  NaN   NaN
    
    #specified columns - columns for join (Year, Code) need always + appended columns (Val)
    df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
    print (df)
        Name  Year  Code  Age   Val
    0    Tom  2000     1   34   NaN
    1   Sara  2003     2   18  34.0
    2    Eva  2003     3   44  23.0
    3   Jack  2004     4   27   NaN
    4  Laura  2007     4   30   NaN
    

    如果获取错误map意味着按连接列重复,则在这里Name:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Age
    0    Tom   34
    1   Sara   18
    2    Eva   44
    3   Jack   27
    4  Laura   30
    
    df3, df4 = df1.copy(), df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                        'Val': [1,2,3,4,5]})
    print (df2)
           Name  Val
    0       Tom    1 <-duplicated name Tom
    1       Tom    2 <-duplicated name Tom
    2       Eva    3
    3      Jack    4
    4  Michelle    5
    
    s = df2.set_index('Name')['Val']
    df1['New'] = df1['Name'].map(s)
    print (df1)
    

    InvalidIndexError:重新索引仅对唯一值的Index对象有效

    解决方案通过删除重复项DataFrame.drop_duplicates,或dict在最后一次重复匹配中使用map by :

    #default keep first value
    s = df2.drop_duplicates('Name').set_index('Name')['Val']
    print (s)
    Name
    Tom         1
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df1['New'] = df1['Name'].map(s)
    print (df1)
        Name  Age  New
    0    Tom   34  1.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #add parameter for keep last value 
    s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
    print (s)
    Name
    Tom         2
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df3['New'] = df3['Name'].map(s)
    print (df3)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #map by dictionary
    d = dict(zip(df2['Name'], df2['Val']))
    print (d)
    {'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
    
    df4['New'] = df4['Name'].map(d)
    print (df4)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看