Python

我能在熊猫身上表演动态累加吗？

发布于 2021-01-29 18:34:48

如果我有以下数据帧，派生如下：df=pd.数据帧(np.随机.随机（0，10，大小=（10，1）））

有没有一种有效的方法“cumsum”行有限制并且每次都有这个限制
已到达，开始新的“cumsum”。在达到每个极限后（不管有多少
rows），则创建一个包含总累计和的行。
下面我创建了一个这样做的函数的例子，但是它非常
速度很慢，尤其是当数据帧变得非常大时。我不喜欢这样，我的朋友
函数是循环的，我正在寻找一种方法使它更快（我猜
没有环路）。

def foo(df, max_value):
    last_value = 0
    storage = []
    for index, row in df.iterrows():
        this_value = np.nansum([row[0], last_value])
        if this_value >= max_value:
            storage.append((index, this_value))
            this_value = 0
        last_value = this_value
    return storage

If you rum my function like so: foo(df, 5) In in the above context, it
returns:

   0
2  10
6  8

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

循环无法避免，但可以使用“numba”的“njit”并行化：

from numba import njit, prange

@njit
def dynamic_cumsum(seq, index, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([index[i], running])
            running = 0
        running += seq[i] 
    cumsum.append([index[-1], running])

    return cumsum

The index is required here, assuming your index is not numeric/monotonically
increasing.

%timeit foo(df, 5)
1.24 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit dynamic_cumsum(df.iloc(axis=1)[0].values, df.index.values, 5)
77.2 µs ± 4.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

If the index is of Int64Index type, you can shorten this to:

@njit
def dynamic_cumsum2(seq, max_value):
    cumsum = []
    running = 0
    for i in prange(len(seq)):
        if running > max_value:
            cumsum.append([i, running])
            running = 0
        running += seq[i] 
    cumsum.append([i, running])

    return cumsum

lst = dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
pd.DataFrame(lst, columns=['A', 'B']).set_index('A')

    B
A    
3  10
7   8
9   4



%timeit foo(df, 5)
1.23 ms ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit dynamic_cumsum2(df.iloc(axis=1)[0].values, 5)
71.4 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

njit Functions Performance

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.randint(0, 10, size=(n, 1))),
    kernels=[
        lambda df: list(cumsum_limit_nb(df.iloc[:, 0].values, 5)),
        lambda df: dynamic_cumsum2(df.iloc[:, 0].values, 5)
    ],
    labels=['cumsum_limit_nb', 'dynamic_cumsum2'],
    n_range=[2**k for k in range(0, 17)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=None # TODO - update when @jpp adds in the final `yield`
)

log-log图显示，生成器函数越大，速度越快输入：
一种可能的解释是，随着N的增加，附加到
“dynamic_cumsum2”中不断增长的列表变得突出。Whilecumsum\u limit\u nb
只需要“屈服”。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看