我能在熊猫身上表演动态累加吗?
如果我有以下数据帧,派生如下:df=pd.数据帧(np.随机.随机(0,10,大小=(10,1)))
0
0 0
1 2
2 8
3 1
4 0
5 0
6 7
7 0
8 2
9 2
有没有一种有效的方法“cumsum”行有限制并且每次都有这个限制
已到达,开始新的“cumsum”。在达到每个极限后(不管有多少
rows),则创建一个包含总累计和的行。
下面我创建了一个这样做的函数的例子,但是它非常
速度很慢,尤其是当数据帧变得非常大时。我不喜欢这样,我的朋友
函数是循环的,我正在寻找一种方法使它更快(我猜
没有环路)。
def foo(df, max_value):
last_value = 0
storage = []
for index, row in df.iterrows():
this_value = np.nansum([row[0], last_value])
if this_value >= max_value:
storage.append((index, this_value))
this_value = 0
last_value = this_value
return storage
If you rum my function like so: foo(df, 5)
In in the above context, it
returns:
0
2 10
6 8
-
循环无法避免,但可以使用“numba”的“njit”并行化:
from numba import njit, prange @njit def dynamic_cumsum(seq, index, max_value): cumsum = [] running = 0 for i in prange(len(seq)): if running > max_value: cumsum.append([index[i], running]) running = 0 running += seq[i] cumsum.append([index[-1], running]) return cumsum
The index is required here, assuming your index is not numeric/monotonically
increasing.%timeit foo(df, 5) 1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5) 77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If the index is of
Int64Index
type, you can shorten this to:@njit def dynamic_cumsum2(seq, max_value): cumsum = [] running = 0 for i in prange(len(seq)): if running > max_value: cumsum.append([i, running]) running = 0 running += seq[i] cumsum.append([i, running]) return cumsum lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5) pd.DataFrame(lst, columns=['A', 'B']).set_index('A') B A 3 10 7 8 9 4 %timeit foo(df, 5) 1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5) 71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
njit
Functions Performanceperfplot.show( setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))), kernels=[ lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)), lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5) ], labels=['cumsum_limit_nb', 'dynamic_cumsum2'], n_range=[2**k for k in range(0, 17)], xlabel='N', logx=True, logy=True, equality_check=None # TODO - update when @jpp adds in the final `yield` )
log-log图显示,生成器函数越大,速度越快输入:
一种可能的解释是,随着N的增加,附加到
“dynamic_cumsum2”中不断增长的列表变得突出。Whilecumsum\u limit\u nb
只需要“屈服”。