Python:滑动窗口均值,忽略丢失的数据

发布于 2021-01-29 14:10:53

我目前正在尝试处理实验性时间序列数据集,该数据集缺少值。我想在处理nan值的同时计算该数据集随时间的滑动窗口平均值。对我而言,正确的方法是在每个窗口内计算有限元素的总和,然后将其除以它们的数量。这种非线性迫使我使用非卷积方法来面对这个问题,因此在该过程的这一部分中我遇到了严重的时间瓶颈。作为我要完成的工作的代码示例,我提出以下内容:

import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None

#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
    part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
    mask = np.isfinite(part_data)
    if np.sum(mask) != 0:
        result[count] = np.sum(part_data[mask]) / np.sum(mask)
    else:
        result[count] = None
print 'Input:\t',data
print 'Output:\t',result

输出:

Input:  [ 0.47431791  0.17620835  0.78495647  0.79894688  0.58334064  0.38068788
  0.87829696         nan  0.71589171         nan  0.70359557  0.76113969
  0.13694387  0.32126573  0.22730891         nan  0.35057169         nan
  0.89251851  0.56226354  0.040117           nan  0.37249799  0.77625334
         nan         nan         nan         nan  0.63227417  0.92781944
  0.99416471  0.81850753  0.35004997         nan  0.80743783  0.60828597
         nan  0.01410721         nan         nan  0.6976317          nan
  0.03875394  0.60924066  0.22998065         nan  0.34476729  0.38090961
         nan  0.2021964 ]
Output: [ 0.32526313  0.47849424  0.5867039   0.72241466  0.58765847  0.61410849
  0.62949242  0.79709433  0.71589171  0.70974364  0.73236763  0.53389305
  0.40644977  0.22850617  0.27428732  0.2889403   0.35057169  0.6215451
  0.72739103  0.49829968  0.30119027  0.20630749  0.57437567  0.57437567
  0.77625334         nan         nan  0.63227417  0.7800468   0.85141944
  0.91349722  0.7209074   0.58427875  0.5787439   0.7078619   0.7078619
  0.31119659  0.01410721  0.01410721  0.6976317   0.6976317   0.36819282
  0.3239973   0.29265842  0.41961066  0.28737397  0.36283845  0.36283845
  0.29155301  0.2021964 ]

可以在不使用for循环的情况下通过numpy操作产生此结果吗?

关注者
0
被浏览
157
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    这是基于卷积的方法,使用np.convolve-

    mask = np.isnan(data)
    K = np.ones(win_size,dtype=int)
    out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
    

    请注意,这将在两侧各增加一个元素。

    如果您正在处理2D数据,我们可以使用Scipy's 2D convolution

    方法-

    def original_app(data, win_size):
        #Compute mean
        result = np.zeros(data.size)
        for count in range(data.size):
            part_data = data[max(count - (win_size - 1) / 2, 0): \
                     min(count + (win_size + 1) / 2, data.size)]
            mask = np.isfinite(part_data)
            if np.sum(mask) != 0:
                result[count] = np.sum(part_data[mask]) / np.sum(mask)
            else:
                result[count] = None
        return result
    
    def numpy_app(data, win_size):     
        mask = np.isnan(data)
        K = np.ones(win_size,dtype=int)
        out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
        return out[1:-1]  # Slice out the one-extra elems on sides
    

    样品运行-

    In [118]: #Construct sample data
         ...: n = 50
         ...: n_miss = 20
         ...: win_size = 3
         ...: data= np.random.random(50)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...:
    
    In [119]: original_app(data, win_size = 3)
    Out[119]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    
    In [120]: numpy_app(data, win_size = 3)
    __main__:36: RuntimeWarning: invalid value encountered in divide
    Out[120]: 
    array([ 0.88356487,  0.86829731,  0.85249541,  0.83776219,         nan,
                   nan,  0.61054015,  0.63111926,  0.63111926,  0.65169837,
            0.1857301 ,  0.58335324,  0.42088104,  0.5384565 ,  0.31027752,
            0.40768907,  0.3478563 ,  0.34089655,  0.55462903,  0.71784816,
            0.93195716,         nan,  0.41635575,  0.52211653,  0.65053379,
            0.76762282,  0.72888574,  0.35250449,  0.35250449,  0.14500637,
            0.06997668,  0.22582318,  0.18621848,  0.36320784,  0.19926647,
            0.24506199,  0.09983572,  0.47595439,  0.79792941,  0.5982114 ,
            0.42389375,  0.28944089,  0.36246113,  0.48088139,  0.71105449,
            0.60234163,  0.40012839,  0.45100475,  0.41768466,  0.41768466])
    

    运行时测试-

    In [122]: #Construct sample data
         ...: n = 50000
         ...: n_miss = 20000
         ...: win_size = 3
         ...: data= np.random.random(n)
         ...: data[np.random.randint(0,n-1, n_miss)] = np.nan
         ...:
    
    In [123]: %timeit original_app(data, win_size = 3)
    1 loops, best of 3: 1.51 s per loop
    
    In [124]: %timeit numpy_app(data, win_size = 3)
    1000 loops, best of 3: 1.09 ms per loop
    
    In [125]: import pandas as pd
    
    # @jdehesa's pandas solution
    In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
    100 loops, best of 3: 3.34 ms per loop
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看