为什么Numpy函数在熊猫系列/数据帧上这么慢?
考虑一个来自另一个问题的小型MWE :
DateTime Data
2017-11-21 18:54:31 1
2017-11-22 02:26:48 2
2017-11-22 10:19:44 3
2017-11-22 15:11:28 6
2017-11-22 23:21:58 7
2017-11-28 14:28:28 28
2017-11-28 14:36:40 0
2017-11-28 14:59:48 1
目的是将所有值都以1作为上限进行裁剪。我的答案使用np.clip
,效果很好。
np.clip(df.Data, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])
要么,
np.clip(df.Data.values, a_min=None, a_max=1)
array([1, 1, 1, 1, 1, 1, 0, 1])
两者都返回相同的答案。我的问题是关于这两种方法的相对性能。考虑-
df = pd.concat([df]*1000).reset_index(drop=True)
%timeit np.clip(df.Data, a_min=None, a_max=1)
1000 loops, best of 3: 270 µs per loop
%timeit np.clip(df.Data.values, a_min=None, a_max=1)
10000 loops, best of 3: 23.4 µs per loop
为什么仅通过调用values
后者就两者之间有如此巨大的差异?换一种说法…
为什么Numpy函数在熊猫对象上这么慢?
-
是的,似乎
np.clip
慢pandas.Series
于numpy.ndarray
s。没错,但实际上(至少在无症状的情况下)还不错。8000个元素仍然处于运行状态,其中恒定因素是运行时的主要贡献者。我认为这是问题的一个非常重要的方面,因此我正在形象地看待(从另一个答案中借用):# Setup import pandas as pd import numpy as np def on_series(s): return np.clip(s, a_min=None, a_max=1) def on_values_of_series(s): return np.clip(s.values, a_min=None, a_max=1) # Timing setup timings = {on_series: [], on_values_of_series: []} sizes = [2**i for i in range(1, 26, 2)] # Timing for size in sizes: func_input = pd.Series(np.random.randint(0, 30, size=size)) for func in timings: res = %timeit -o func(func_input) timings[func].append(res) %matplotlib notebook import matplotlib.pyplot as plt import numpy as np fig, (ax1, ax2) = plt.subplots(1, 2) for func in timings: ax1.plot(sizes, [time.best for time in timings[func]], label=str(func.__name__)) ax1.set_xscale('log') ax1.set_yscale('log') ax1.set_xlabel('size') ax1.set_ylabel('time [seconds]') ax1.grid(which='both') ax1.legend() baseline = on_values_of_series # choose one function as baseline for func in timings: ax2.plot(sizes, [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], label=str(func.__name__)) ax2.set_yscale('log') ax2.set_xscale('log') ax2.set_xlabel('size') ax2.set_ylabel('time relative to {}'.format(baseline.__name__)) ax2.grid(which='both') ax2.legend() plt.tight_layout()
这是一个对数-
对数图,因为我认为这更清楚地显示了重要功能。例如,它表明np.clip
anumpy.ndarray
上的速度更快,但在那种情况下它的常数因子也小得多。大型阵列的差异仅为〜3!这仍然是一个很大的差异,但是比小型阵列的差异要小。但是,这仍然不能解决时差来自何处的问题。
解决方案实际上非常简单:
np.clip
将第一个参数的clip
方法 委托给:>>> np.clip?? Source: def clip(a, a_min, a_max, out=None): """ ... """ return _wrapfunc(a, 'clip', a_min, a_max, out=out) >>> np.core.fromnumeric._wrapfunc?? Source: def _wrapfunc(obj, method, *args, **kwds): try: return getattr(obj, method)(*args, **kwds) # ... except (AttributeError, TypeError): return _wrapit(obj, method, *args, **kwds)
在
getattr
该行_wrapfunc
的功能是重要的线在这里,因为np.ndarray.clip
和pd.Series.clip
不同的方法,是的,
完全不同的方法 :>>> np.ndarray.clip <method 'clip' of 'numpy.ndarray' objects> >>> pd.Series.clip <function pandas.core.generic.NDFrame.clip>
不幸的是
np.ndarray.clip
,它是一个C函数,因此很难对其进行分析,但是它pd.Series.clip
是常规的Python函数,因此它易于分析。让我们在这里使用一系列5000个整数:s = pd.Series(np.random.randint(0, 100, 5000))
对于
np.clip
在values
我得到以下行剖析:%load_ext line_profiler %lprun -f np.clip -f np.core.fromnumeric._wrapfunc np.clip(s.values, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 2.25641e-05 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 55 55.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 1.51795e-05 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 5.4 try: 57 1 35 35.0 94.6 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds)
但是对于
np.clip
,Series
我得到了完全不同的分析结果:%lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.000823794 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 2008 2008.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00081846 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 2 2.0 0.1 try: 57 1 1993 1993.0 99.9 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.000804922 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 12 12.0 0.6 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 10 10.0 0.5 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 69 69.0 3.5 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 158 158.0 8.1 if np.any(pd.isnull(lower)): 5033 1 3 3.0 0.2 lower = None 5034 1 26 26.0 1.3 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 1 1.0 0.1 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 1 1.0 0.1 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 28 28.0 1.4 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 1654 1654.0 84.3 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.000662153 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.1 if ((lower is not None and np.any(isna(lower))) or 4922 1 25 25.0 1.5 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 22 22.0 1.4 result = self.values 4926 1 571 571.0 35.4 mask = isna(result) 4927 4928 1 95 95.0 5.9 with np.errstate(all='ignore'): 4929 1 1 1.0 0.1 if upper is not None: 4930 1 141 141.0 8.7 result = np.where(result >= upper, upper, result) 4931 1 33 33.0 2.0 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 73 73.0 4.5 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 90 90.0 5.6 axes_dict = self._construct_axes_dict() 4937 1 558 558.0 34.6 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.1 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.1 return result
那时我不再进入子例程,因为它已经突出显示了在哪里
pd.Series.clip
执行的工作比在处更多np.ndarray.clip
。只需将(55个计时器单位)np.clip
上的调用总时间与该方法values
中的第一个检查(158个计时器单位)进行比较即可。那时,pandas方法甚至没有从裁剪开始,它已经花费了3倍的时间。pandas.Series.clip``if np.any(pd.isnull(lower))
但是,当数组很大时,这些“开销”中的几个就变得微不足道了:
s = pd.Series(np.random.randint(0, 100, 1000000)) %lprun -f np.clip -f np.core.fromnumeric._wrapfunc -f pd.Series.clip -f pd.Series._clip_with_scalar np.clip(s, a_min=None, a_max=1) Timer unit: 4.10256e-07 s Total time: 0.00593476 s File: numpy\core\fromnumeric.py Function: clip at line 1673 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1673 def clip(a, a_min, a_max, out=None): 1674 """ ... 1726 """ 1727 1 14466 14466.0 100.0 return _wrapfunc(a, 'clip', a_min, a_max, out=out) Total time: 0.00592779 s File: numpy\core\fromnumeric.py Function: _wrapfunc at line 55 Line # Hits Time Per Hit % Time Line Contents ============================================================== 55 def _wrapfunc(obj, method, *args, **kwds): 56 1 1 1.0 0.0 try: 57 1 14448 14448.0 100.0 return getattr(obj, method)(*args, **kwds) 58 59 # An AttributeError occurs if the object does not have 60 # such a method in its class. 61 62 # A TypeError occurs if the object does have such a method 63 # in its class, but its signature is not identical to that 64 # of NumPy's. This situation has occurred in the case of 65 # a downstream library like 'pandas'. 66 except (AttributeError, TypeError): 67 return _wrapit(obj, method, *args, **kwds) Total time: 0.00591302 s File: pandas\core\generic.py Function: clip at line 4969 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4969 def clip(self, lower=None, upper=None, axis=None, inplace=False, 4970 *args, **kwargs): 4971 """ ... 5021 """ 5022 1 17 17.0 0.1 if isinstance(self, ABCPanel): 5023 raise NotImplementedError("clip is not supported yet for panels") 5024 5025 1 14 14.0 0.1 inplace = validate_bool_kwarg(inplace, 'inplace') 5026 5027 1 97 97.0 0.7 axis = nv.validate_clip_with_axis(axis, args, kwargs) 5028 5029 # GH 17276 5030 # numpy doesn't like NaN as a clip value 5031 # so ignore 5032 1 125 125.0 0.9 if np.any(pd.isnull(lower)): 5033 1 2 2.0 0.0 lower = None 5034 1 30 30.0 0.2 if np.any(pd.isnull(upper)): 5035 upper = None 5036 5037 # GH 2747 (arguments were reversed) 5038 1 2 2.0 0.0 if lower is not None and upper is not None: 5039 if is_scalar(lower) and is_scalar(upper): 5040 lower, upper = min(lower, upper), max(lower, upper) 5041 5042 # fast-path for scalars 5043 1 2 2.0 0.0 if ((lower is None or (is_scalar(lower) and is_number(lower))) and 5044 1 32 32.0 0.2 (upper is None or (is_scalar(upper) and is_number(upper)))): 5045 1 14092 14092.0 97.8 return self._clip_with_scalar(lower, upper, inplace=inplace) 5046 5047 result = self 5048 if lower is not None: 5049 result = result.clip_lower(lower, axis, inplace=inplace) 5050 if upper is not None: 5051 if inplace: 5052 result = self 5053 result = result.clip_upper(upper, axis, inplace=inplace) 5054 5055 return result Total time: 0.00575753 s File: pandas\core\generic.py Function: _clip_with_scalar at line 4920 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4920 def _clip_with_scalar(self, lower, upper, inplace=False): 4921 1 2 2.0 0.0 if ((lower is not None and np.any(isna(lower))) or 4922 1 28 28.0 0.2 (upper is not None and np.any(isna(upper)))): 4923 raise ValueError("Cannot use an NA value as a clip threshold") 4924 4925 1 120 120.0 0.9 result = self.values 4926 1 3525 3525.0 25.1 mask = isna(result) 4927 4928 1 86 86.0 0.6 with np.errstate(all='ignore'): 4929 1 2 2.0 0.0 if upper is not None: 4930 1 9314 9314.0 66.4 result = np.where(result >= upper, upper, result) 4931 1 61 61.0 0.4 if lower is not None: 4932 result = np.where(result <= lower, lower, result) 4933 1 283 283.0 2.0 if np.any(mask): 4934 result[mask] = np.nan 4935 4936 1 78 78.0 0.6 axes_dict = self._construct_axes_dict() 4937 1 532 532.0 3.8 result = self._constructor(result, **axes_dict).__finalize__(self) 4938 4939 1 2 2.0 0.0 if inplace: 4940 self._update_inplace(result) 4941 else: 4942 1 1 1.0 0.0 return result
仍然存在多个函数调用,例如
isna
和np.where
,这需要花费大量时间,但是总的来说,这至少与该np.ndarray.clip
时间相当(这是在我的计算机上,时间差约为3的状态)。外卖可能应该是:
- 许多NumPy函数只是委托给传入对象的方法,因此当您传入不同对象时,可能会有巨大差异。
- 剖析,尤其是行剖析,可以成为查找性能差异来源的好工具。
- 在这种情况下,请务必确保测试大小不同的对象。您可能正在比较可能无关紧要的常数因子,除非您处理许多小数组。
使用的版本:
Python 3.6.3 64-bit on Windows 10 Numpy 1.13.3 Pandas 0.21.1