为什么熊猫在这里应用lambda比循环慢?

发布于 2021-01-29 15:02:44

我有一个熊猫数据框,我想根据是否满足某些条件进行过滤。我跑了一个循环,.apply()然后用来%%timeit测试速度。数据集大约有45000行。循环的代码片段为:

%%timeit
qualified_actions = []
for row in all_actions.index:
    if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
        qualified_actions.append(True)
    else:
        qualified_actions.append(False)

每个循环1.44 s±3.7毫秒(平均±标准偏差,共7次运行,每个循环1次)

而且.apply()是:

%%timeit
qualified_actions = all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1)

每个循环6.71 s±54.6 ms(平均±标准偏差,共7次运行,每个循环1次)

我认为.apply()应该比循环遍历大熊猫更快。有人可以解释为什么在这种情况下速度变慢吗?

关注者
0
被浏览
84
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    apply在后台使用循环,因此,如果需要更好的性能,最好的和最快的方法是最好的选择。

    没有循环,只有链2条件向量化解决方案:

    m1 = all_actions['Lower'] <= all_actions['Mid']
    m2 = all_actions['Mid'] <= all_actions['Upper']
    qualified_actions = m1 & m2
    

    感谢Jon Clements提供的另一种解决方案:

    all_actions.Mid.between(all_actions.Lower, all_actions.Upper)
    

    时间

    np.random.seed(2017)
    N = 45000
    all_actions=pd.DataFrame(np.random.randint(50, size=(N,3)),columns=['Lower','Mid','Upper'])
    
    #print (all_actions)
    

    In [85]: %%timeit
        ...: qualified_actions = []
        ...: for row in all_actions.index:
        ...:     if all_actions.ix[row,'Lower'] <= all_actions.ix[row, 'Mid'] <= all_actions.ix[row,'Upper']:
        ...:         qualified_actions.append(True)
        ...:     else:
        ...:         qualified_actions.append(False)
        ...: 
        ...: 
    __main__:259: DeprecationWarning: 
    .ix is deprecated. Please use
    .loc for label based indexing or
    .iloc for positional indexing
    
    See the documentation here:
    http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
    1 loop, best of 3: 579 ms per loop
    
    In [86]: %%timeit
        ...: (all_actions.apply(lambda row: row['Lower'] <= row['Mid'] <= row['Upper'], axis=1))
        ...: 
    1 loop, best of 3: 1.17 s per loop
    
    In [87]: %%timeit
        ...: ((all_actions['Lower'] <= all_actions['Mid']) & (all_actions['Mid'] <= all_actions['Upper']))
        ...: 
    1000 loops, best of 3: 509 µs per loop
    
    
    In [90]: %%timeit
        ...: (all_actions.Mid.between(all_actions.Lower, all_actions.Upper))
        ...: 
    1000 loops, best of 3: 520 µs per loop
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看