为什么Numpy函数在熊猫系列/数据帧上这么慢?

发布于 2021-01-29 15:15:04

考虑一个来自另一个问题的小型MWE :

DateTime                Data
2017-11-21 18:54:31     1
2017-11-22 02:26:48     2
2017-11-22 10:19:44     3
2017-11-22 15:11:28     6
2017-11-22 23:21:58     7
2017-11-28 14:28:28    28
2017-11-28 14:36:40     0
2017-11-28 14:59:48     1

目的是将所有值都以1作为上限进行裁剪。我的答案使用np.clip,效果很好。

np.clip(df.Data, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

要么,

np.clip(df.Data.values, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])

两者都返回相同的答案。我的问题是关于这两种方法的相对性能。考虑-

df = pd.concat([df]*1000).reset_index(drop=True)

%timeit np.clip(df.Data, a_min=None, a_max=1)
1000 loops, best of 3: 270 µs per loop

%timeit np.clip(df.Data.values, a_min=None, a_max=1)
10000 loops, best of 3: 23.4 µs per loop

为什么仅通过调用values后者就两者之间有如此巨大的差异?换一种说法…

为什么Numpy函数在熊猫对象上这么慢?

关注者
0
被浏览
221
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    是的,似乎np.clippandas.Seriesnumpy.ndarrays。没错,但实际上(至少在无症状的情况下)还不错。8000个元素仍然处于运行状态,其中恒定因素是运行时的主要贡献者。我认为这是问题的一个非常重要的方面,因此我正在形象地看待(从另一个答案中借用):

    # Setup
    
    import pandas as pd
    import numpy as np
    
    def on_series(s):
        return np.clip(s, a_min=None, a_max=1)
    
    def on_values_of_series(s):
        return np.clip(s.values, a_min=None, a_max=1)
    
    # Timing setup
    timings = {on_series: [], on_values_of_series: []}
    sizes = [2**i for i in range(1, 26, 2)]
    
    # Timing
    for size in sizes:
        func_input = pd.Series(np.random.randint(0, 30, size=size))
        for func in timings:
            res = %timeit -o func(func_input)
            timings[func].append(res)
    
    %matplotlib notebook
    
    import matplotlib.pyplot as plt
    import numpy as np
    
    fig, (ax1, ax2) = plt.subplots(1, 2)
    
    for func in timings:
        ax1.plot(sizes, 
                 [time.best for time in timings[func]], 
                 label=str(func.__name__))
    ax1.set_xscale('log')
    ax1.set_yscale('log')
    ax1.set_xlabel('size')
    ax1.set_ylabel('time [seconds]')
    ax1.grid(which='both')
    ax1.legend()
    
    baseline = on_values_of_series # choose one function as baseline
    for func in timings:
        ax2.plot(sizes, 
                 [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
                 label=str(func.__name__))
    ax2.set_yscale('log')
    ax2.set_xscale('log')
    ax2.set_xlabel('size')
    ax2.set_ylabel('time relative to {}'.format(baseline.__name__))
    ax2.grid(which='both')
    ax2.legend()
    
    plt.tight_layout()
    

    在此处输入图片说明

    这是一个对数-
    对数图,因为我认为这更清楚地显示了重要功能。例如,它表明np.clipanumpy.ndarray上的速度更快,但在那种情况下它的常数因子也小得多。大型阵列的差异仅为〜3!这仍然是一个很大的差异,但是比小型阵列的差异要小。

    但是,这仍然不能解决时差来自何处的问题。

    解决方案实际上非常简单:np.clip将第一个参数的clip 方法 委托给:

    >>> np.clip??
    Source:   
    def clip(a, a_min, a_max, out=None):
        """
        ...
        """
        return _wrapfunc(a, 'clip', a_min, a_max, out=out)
    
    >>> np.core.fromnumeric._wrapfunc??
    Source:   
    def _wrapfunc(obj, method, *args, **kwds):
        try:
            return getattr(obj, method)(*args, **kwds)
        # ...
        except (AttributeError, TypeError):
            return _wrapit(obj, method, *args, **kwds)
    

    getattr该行_wrapfunc的功能是重要的线在这里,因为np.ndarray.clippd.Series.clip不同的方法,是的,
    完全不同的方法

    >>> np.ndarray.clip
    <method 'clip' of 'numpy.ndarray' objects>
    >>> pd.Series.clip
    <function pandas.core.generic.NDFrame.clip>
    

    不幸的是np.ndarray.clip,它是一个C函数,因此很难对其进行分析,但是它pd.Series.clip是常规的Python函数,因此它易于分析。让我们在这里使用一系列5000个整数:

    s = pd.Series(np.random.randint(0, 100, 5000))
    

    对于np.clipvalues我得到以下行剖析:

    %load_ext line_profiler
    %lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1)
    
    Timer unit: 4.10256e-07 s
    
    Total time: 2.25641e-05 s
    File: numpy\core\fromnumeric.py
    Function: clip at line 1673
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      1673                                           def clip(a, a_min, a_max, out=None):
      1674                                               """
      ...
      1726                                               """
      1727         1           55     55.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)
    
    Total time: 1.51795e-05 s
    File: numpy\core\fromnumeric.py
    Function: _wrapfunc at line 55
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
        55                                           def _wrapfunc(obj, method, *args, **kwds):
        56         1            2      2.0      5.4      try:
        57         1           35     35.0     94.6          return getattr(obj, method)(*args, **kwds)
        58                                           
        59                                               # An AttributeError occurs if the object does not have
        60                                               # such a method in its class.
        61                                           
        62                                               # A TypeError occurs if the object does have such a method
        63                                               # in its class, but its signature is not identical to that
        64                                               # of NumPy's. This situation has occurred in the case of
        65                                               # a downstream library like 'pandas'.
        66                                               except (AttributeError, TypeError):
        67                                                   return _wrapit(obj, method, *args, **kwds)
    

    但是对于np.clipSeries我得到了完全不同的分析结果:

    %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)
    
    Timer unit: 4.10256e-07 s
    
    Total time: 0.000823794 s
    File: numpy\core\fromnumeric.py
    Function: clip at line 1673
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      1673                                           def clip(a, a_min, a_max, out=None):
      1674                                               """
      ...
      1726                                               """
      1727         1         2008   2008.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)
    
    Total time: 0.00081846 s
    File: numpy\core\fromnumeric.py
    Function: _wrapfunc at line 55
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
        55                                           def _wrapfunc(obj, method, *args, **kwds):
        56         1            2      2.0      0.1      try:
        57         1         1993   1993.0     99.9          return getattr(obj, method)(*args, **kwds)
        58                                           
        59                                               # An AttributeError occurs if the object does not have
        60                                               # such a method in its class.
        61                                           
        62                                               # A TypeError occurs if the object does have such a method
        63                                               # in its class, but its signature is not identical to that
        64                                               # of NumPy's. This situation has occurred in the case of
        65                                               # a downstream library like 'pandas'.
        66                                               except (AttributeError, TypeError):
        67                                                   return _wrapit(obj, method, *args, **kwds)
    
    Total time: 0.000804922 s
    File: pandas\core\generic.py
    Function: clip at line 4969
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
      4970                                                        *args, **kwargs):
      4971                                                   """
      ...
      5021                                                   """
      5022         1           12     12.0      0.6          if isinstance(self, ABCPanel):
      5023                                                       raise NotImplementedError("clip is not supported yet for panels")
      5024                                           
      5025         1           10     10.0      0.5          inplace = validate_bool_kwarg(inplace, 'inplace')
      5026                                           
      5027         1           69     69.0      3.5          axis = nv.validate_clip_with_axis(axis, args, kwargs)
      5028                                           
      5029                                                   # GH 17276
      5030                                                   # numpy doesn't like NaN as a clip value
      5031                                                   # so ignore
      5032         1          158    158.0      8.1          if np.any(pd.isnull(lower)):
      5033         1            3      3.0      0.2              lower = None
      5034         1           26     26.0      1.3          if np.any(pd.isnull(upper)):
      5035                                                       upper = None
      5036                                           
      5037                                                   # GH 2747 (arguments were reversed)
      5038         1            1      1.0      0.1          if lower is not None and upper is not None:
      5039                                                       if is_scalar(lower) and is_scalar(upper):
      5040                                                           lower, upper = min(lower, upper), max(lower, upper)
      5041                                           
      5042                                                   # fast-path for scalars
      5043         1            1      1.0      0.1          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
      5044         1           28     28.0      1.4                  (upper is None or (is_scalar(upper) and is_number(upper)))):
      5045         1         1654   1654.0     84.3              return self._clip_with_scalar(lower, upper, inplace=inplace)
      5046                                           
      5047                                                   result = self
      5048                                                   if lower is not None:
      5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
      5050                                                   if upper is not None:
      5051                                                       if inplace:
      5052                                                           result = self
      5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
      5054                                           
      5055                                                   return result
    
    Total time: 0.000662153 s
    File: pandas\core\generic.py
    Function: _clip_with_scalar at line 4920
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
      4921         1            2      2.0      0.1          if ((lower is not None and np.any(isna(lower))) or
      4922         1           25     25.0      1.5                  (upper is not None and np.any(isna(upper)))):
      4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
      4924                                           
      4925         1           22     22.0      1.4          result = self.values
      4926         1          571    571.0     35.4          mask = isna(result)
      4927                                           
      4928         1           95     95.0      5.9          with np.errstate(all='ignore'):
      4929         1            1      1.0      0.1              if upper is not None:
      4930         1          141    141.0      8.7                  result = np.where(result >= upper, upper, result)
      4931         1           33     33.0      2.0              if lower is not None:
      4932                                                           result = np.where(result <= lower, lower, result)
      4933         1           73     73.0      4.5          if np.any(mask):
      4934                                                       result[mask] = np.nan
      4935                                           
      4936         1           90     90.0      5.6          axes_dict = self._construct_axes_dict()
      4937         1          558    558.0     34.6          result = self._constructor(result, **axes_dict).__finalize__(self)
      4938                                           
      4939         1            2      2.0      0.1          if inplace:
      4940                                                       self._update_inplace(result)
      4941                                                   else:
      4942         1            1      1.0      0.1              return result
    

    那时我不再进入子例程,因为它已经突出显示了在哪里pd.Series.clip执行的工作比在处更多np.ndarray.clip。只需将(55个计时器单位)np.clip上的调用总时间与该方法values中的第一个检查(158个计时器单位)进行比较即可。那时,pandas方法甚至没有从裁剪开始,它已经花费了3倍的时间。pandas.Series.clip``if np.any(pd.isnull(lower))

    但是,当数组很大时,这些“开销”中的几个就变得微不足道了:

    s = pd.Series(np.random.randint(0, 100, 1000000))
    
    %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1)
    
    Timer unit: 4.10256e-07 s
    
    Total time: 0.00593476 s
    File: numpy\core\fromnumeric.py
    Function: clip at line 1673
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      1673                                           def clip(a, a_min, a_max, out=None):
      1674                                               """
      ...
      1726                                               """
      1727         1        14466  14466.0    100.0      return _wrapfunc(a, 'clip', a_min, a_max, out=out)
    
    Total time: 0.00592779 s
    File: numpy\core\fromnumeric.py
    Function: _wrapfunc at line 55
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
        55                                           def _wrapfunc(obj, method, *args, **kwds):
        56         1            1      1.0      0.0      try:
        57         1        14448  14448.0    100.0          return getattr(obj, method)(*args, **kwds)
        58                                           
        59                                               # An AttributeError occurs if the object does not have
        60                                               # such a method in its class.
        61                                           
        62                                               # A TypeError occurs if the object does have such a method
        63                                               # in its class, but its signature is not identical to that
        64                                               # of NumPy's. This situation has occurred in the case of
        65                                               # a downstream library like 'pandas'.
        66                                               except (AttributeError, TypeError):
        67                                                   return _wrapit(obj, method, *args, **kwds)
    
    Total time: 0.00591302 s
    File: pandas\core\generic.py
    Function: clip at line 4969
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      4969                                               def clip(self, lower=None, upper=None, axis=None, inplace=False,
      4970                                                        *args, **kwargs):
      4971                                                   """
      ...
      5021                                                   """
      5022         1           17     17.0      0.1          if isinstance(self, ABCPanel):
      5023                                                       raise NotImplementedError("clip is not supported yet for panels")
      5024                                           
      5025         1           14     14.0      0.1          inplace = validate_bool_kwarg(inplace, 'inplace')
      5026                                           
      5027         1           97     97.0      0.7          axis = nv.validate_clip_with_axis(axis, args, kwargs)
      5028                                           
      5029                                                   # GH 17276
      5030                                                   # numpy doesn't like NaN as a clip value
      5031                                                   # so ignore
      5032         1          125    125.0      0.9          if np.any(pd.isnull(lower)):
      5033         1            2      2.0      0.0              lower = None
      5034         1           30     30.0      0.2          if np.any(pd.isnull(upper)):
      5035                                                       upper = None
      5036                                           
      5037                                                   # GH 2747 (arguments were reversed)
      5038         1            2      2.0      0.0          if lower is not None and upper is not None:
      5039                                                       if is_scalar(lower) and is_scalar(upper):
      5040                                                           lower, upper = min(lower, upper), max(lower, upper)
      5041                                           
      5042                                                   # fast-path for scalars
      5043         1            2      2.0      0.0          if ((lower is None or (is_scalar(lower) and is_number(lower))) and
      5044         1           32     32.0      0.2                  (upper is None or (is_scalar(upper) and is_number(upper)))):
      5045         1        14092  14092.0     97.8              return self._clip_with_scalar(lower, upper, inplace=inplace)
      5046                                           
      5047                                                   result = self
      5048                                                   if lower is not None:
      5049                                                       result = result.clip_lower(lower, axis, inplace=inplace)
      5050                                                   if upper is not None:
      5051                                                       if inplace:
      5052                                                           result = self
      5053                                                       result = result.clip_upper(upper, axis, inplace=inplace)
      5054                                           
      5055                                                   return result
    
    Total time: 0.00575753 s
    File: pandas\core\generic.py
    Function: _clip_with_scalar at line 4920
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      4920                                               def _clip_with_scalar(self, lower, upper, inplace=False):
      4921         1            2      2.0      0.0          if ((lower is not None and np.any(isna(lower))) or
      4922         1           28     28.0      0.2                  (upper is not None and np.any(isna(upper)))):
      4923                                                       raise ValueError("Cannot use an NA value as a clip threshold")
      4924                                           
      4925         1          120    120.0      0.9          result = self.values
      4926         1         3525   3525.0     25.1          mask = isna(result)
      4927                                           
      4928         1           86     86.0      0.6          with np.errstate(all='ignore'):
      4929         1            2      2.0      0.0              if upper is not None:
      4930         1         9314   9314.0     66.4                  result = np.where(result >= upper, upper, result)
      4931         1           61     61.0      0.4              if lower is not None:
      4932                                                           result = np.where(result <= lower, lower, result)
      4933         1          283    283.0      2.0          if np.any(mask):
      4934                                                       result[mask] = np.nan
      4935                                           
      4936         1           78     78.0      0.6          axes_dict = self._construct_axes_dict()
      4937         1          532    532.0      3.8          result = self._constructor(result, **axes_dict).__finalize__(self)
      4938                                           
      4939         1            2      2.0      0.0          if inplace:
      4940                                                       self._update_inplace(result)
      4941                                                   else:
      4942         1            1      1.0      0.0              return result
    

    仍然存在多个函数调用,例如isnanp.where,这需要花费大量时间,但是总的来说,这至少与该np.ndarray.clip时间相当(这是在我的计算机上,时间差约为3的状态)。

    外卖可能应该是:

    • 许多NumPy函数只是委托给传入对象的方法,因此当您传入不同对象时,可能会有巨大差异。
    • 剖析,尤其是行剖析,可以成为查找性能差异来源的好工具。
    • 在这种情况下,请务必确保测试大小不同的对象。您可能正在比较可能无关紧要的常数因子,除非您处理许多小数组。

    使用的版本:

    Python 3.6.3 64-bit on Windows 10
    Numpy 1.13.3
    Pandas 0.21.1
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看