在pandas数据框中分组重复的列ID

发布于 2021-01-29 19:34:08

现在有很多类似的问题,但是大多数回答了如何删除重复的列。但是,我想知道如何制作一个元组列表,其中每个元组都包含重复列的列名。我假设每一列都有一个唯一的名称。为了进一步说明我的问题:

df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
                   'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
                   'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
                   index = ['a1', 'a2', 'a3', 'a4', 'a5'])

然后我想要输出:

[('A', 'C'), ('B', 'D')]

如果今天您感觉很好,则将相同的问题扩展到行。如何获取元组列表,其中每个元组都包含重复的行。

关注者
0
被浏览
104
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    这是NumPy的一种方法-

    def group_duplicate_cols(df):
        a = df.values
        sidx = np.lexsort(a)
        b = a[:,sidx]
    
        m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
        idx = np.flatnonzero(m[1:] != m[:-1])
        C = df.columns[sidx].tolist()
        return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
    

    样品运行-

    In [100]: df
    Out[100]: 
        A  B  C  D  E  F
    a1  1  2  1  2  3  1
    a2  2  4  2  4  4  1
    a3  3  2  3  2  2  1
    a4  4  1  4  1  1  1
    a5  5  9  5  9  2  1
    
    In [101]: group_duplicate_cols(df)
    Out[101]: [['A', 'C'], ['B', 'D']]
    
    # Let's add one more duplicate into group containing 'A'
    In [102]: df.F = df.A
    
    In [103]: group_duplicate_cols(df)
    Out[103]: [['A', 'C', 'F'], ['B', 'D']]
    

    进行转换即可,但是对于rows(index),我们只需要沿另一条轴切换操作,就像这样-

    def group_duplicate_rows(df):
        a = df.values
        sidx = np.lexsort(a.T)
        b = a[sidx]
    
        m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
        idx = np.flatnonzero(m[1:] != m[:-1])
        C = df.index[sidx].tolist()
        return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
    

    样品运行-

    In [260]: df2
    Out[260]: 
       a1  a2  a3  a4  a5
    A   3   5   3   4   5
    B   1   1   1   1   1
    C   3   5   3   4   5
    D   2   9   2   1   9
    E   2   2   2   1   2
    F   1   1   1   1   1
    
    In [261]: group_duplicate_rows(df2)
    Out[261]: [['B', 'F'], ['A', 'C']]
    

    标杆管理

    方法-

    # @John Galt's soln-1
    from itertools import combinations
    def combinations_app(df):
        return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
    
    # @Abdou's soln
    def pandas_groupby_app(df):
        return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]
    
    # @COLDSPEED's soln
    def triu_app(df):
        c = df.columns.tolist()
        i, j = np.triu_indices(len(c), 1)
        x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
        return x
    
    # @cmaher's soln
    def lambda_set_app(df):
        return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))
    

    注意:@John Galt's soln-2未包括在内,因为输入的大小(8000,500)会与针对该输入的提议不符broadcasting

    时间-

    In [179]: # Setup inputs with sizes as mentioned in the question
         ...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
         ...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
         ...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
         ...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
         ...: df.iloc[:,idx0] = df.iloc[:,idx1].values
         ...:
    
    # @John Galt's soln-1
    In [180]: %timeit combinations_app(df)
    1 loops, best of 3: 24.6 s per loop
    
    # @Abdou's soln
    In [181]: %timeit pandas_groupby_app(df)
    1 loops, best of 3: 3.81 s per loop
    
    # @COLDSPEED's soln
    In [182]: %timeit triu_app(df)
    1 loops, best of 3: 25.5 s per loop
    
    # @cmaher's soln
    In [183]: %timeit lambda_set_app(df)
    1 loops, best of 3: 27.1 s per loop
    
    # Proposed in this post
    In [184]: %timeit group_duplicate_cols(df)
    10 loops, best of 3: 188 ms per loop
    

    借助NumPy的视图功能实现超级助推

    借助NumPy的视图功能,我们可以将每组元素视为一个dtype,这样可以进一步提高性能,如下所示-

    def view1D(a): # a is array
        a = np.ascontiguousarray(a)
        void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
        return a.view(void_dt).ravel()
    
    def group_duplicate_cols_v2(df):
        a = df.values
        sidx = view1D(a.T).argsort()
        b = a[:,sidx]
    
        m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
        idx = np.flatnonzero(m[1:] != m[:-1])
        C = df.columns[sidx].tolist()
        return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
    

    时间-

    In [322]: %timeit group_duplicate_cols(df)
    10 loops, best of 3: 185 ms per loop
    
    In [323]: %timeit group_duplicate_cols_v2(df)
    10 loops, best of 3: 69.3 ms per loop
    

    只是疯狂的加速!



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看