如何为每行用numpy random.choice创建2d数组?

发布于 2021-01-29 18:43:14

我正在尝试创建一个numpy随机选择的2d数组(由六列和许多行组成),每行的唯一值介于1到50之间,不是数组的全部

np.sort(np.random.choice(np.arange(1,50),size=(100,6),replace=False))

但这会引起错误。

ValueError: Cannot take a larger sample than population when 'replace=False'

有没有可能用一个衬套做到这一点

编辑

好的,我得到了答案。

这些是jupyter%time cellmagic的结果

#@James' solution
np.stack([np.random.choice(np.arange(1,50),size=6,replace=False) for i in range(1_000_000)])
Wall time: 25.1 s



#@Divakar's solution
np.random.rand(1_000_000, 50).argpartition(6,axis=1)[:,:6]+1
Wall time: 1.36 s



#@CoryKramer's solution
np.array([np.random.choice(np.arange(1, 50), size=6, replace=False) for _ in range(1_000_000)])
Wall time: 25.5 s

我在@Paul Panzer的解决方案上更改了 np.empty和np.random.randint的dtypes ,因为它在我的电脑上不起作用。

3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]

最快的是

def pp(n):
    draw = np.empty((n, 6), dtype=np.int64)
    # generating random numbers is expensive, so draw a large one and
    # make six out of one
    draw[:, 0] = np.random.randint(0, 50*49*48*47*46*45, (n,),dtype=np.uint64)
    draw[:, 1:] = np.arange(50, 45, -1)
    draw = np.floor_divide.accumulate(draw, axis=-1)
    draw[:, :-1] -= draw[:, 1:] * np.arange(50, 45, -1)
    # map the shorter ranges (:49, :48, :47) to the non-occupied
    # positions; this amounts to incrementing for each number on the
    # left that is not larger. the nasty bit: if due to incrementing
    # new numbers on the left are "overtaken" then for them we also
    # need to increment.
    for i in range(1, 6):
        coll = np.sum(draw[:, :i] <= draw[:, i, None], axis=-1)
        collidx = np.flatnonzero(coll)
        if collidx.size == 0:
            continue
        coll = coll[collidx]
        tot = coll
        while True:
            draw[collidx, i] += coll
            coll = np.sum(draw[collidx, :i] <= draw[collidx, i, None],  axis=-1)
            relidx = np.flatnonzero(coll > tot)
            if relidx.size == 0:
                break
            coll, tot = coll[relidx]-tot[relidx], coll[relidx]
            collidx = collidx[relidx]

    return draw + 1

#@Paul Panzer' solution
pp(1_000_000)
Wall time: 557 ms

谢谢你们。

关注者
0
被浏览
84
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    这是一种有建设性的方法,首先绘制(50个选择),然后绘制(49个选择),等等。对于大集合,这是很有竞争力的(表中的pp):

    # n = 10
    # pp                    0.18564210 ms
    # Divakar               0.01960790 ms
    # James                 0.20074140 ms
    # CK                    0.17823420 ms
    # n = 1000
    # pp                    0.80046050 ms
    # Divakar               1.31817130 ms
    # James                18.93511460 ms
    # CK                   20.83670820 ms
    # n = 1000000
    # pp                  655.32905590 ms
    # Divakar            1352.44713990 ms
    # James             18471.08987370 ms
    # CK                18369.79808050 ms
    # pp     checking plausibility...
    #     var (exp obs) 208.333333333 208.363840259
    #     mean (exp obs) 25.5 25.5064865
    # Divakar     checking plausibility...
    #     var (exp obs) 208.333333333 208.21113972
    #     mean (exp obs) 25.5 25.499471
    # James     checking plausibility...
    #     var (exp obs) 208.333333333 208.313436938
    #     mean (exp obs) 25.5 25.4979035
    # CK     checking plausibility...
    #     var (exp obs) 208.333333333 208.169585249
    #     mean (exp obs) 25.5 25.49
    

    代码包括基准测试。算法有点复杂,因为映射到自由点很麻烦:

    import numpy as np
    import types
    from timeit import timeit
    
    def f_pp(n):
        draw = np.empty((n, 6), dtype=int)
        # generating random numbers is expensive, so draw a large one and
        # make six out of one
        draw[:, 0] = np.random.randint(0, 50*49*48*47*46*45, (n,))
        draw[:, 1:] = np.arange(50, 45, -1)
        draw = np.floor_divide.accumulate(draw, axis=-1)
        draw[:, :-1] -= draw[:, 1:] * np.arange(50, 45, -1)
        # map the shorter ranges (:49, :48, :47) to the non-occupied
        # positions; this amounts to incrementing for each number on the
        # left that is not larger. the nasty bit: if due to incrementing
        # new numbers on the left are "overtaken" then for them we also
        # need to increment.
        for i in range(1, 6):
            coll = np.sum(draw[:, :i] <= draw[:, i, None], axis=-1)
            collidx = np.flatnonzero(coll)
            if collidx.size == 0:
                continue
            coll = coll[collidx]
            tot = coll
            while True:
                draw[collidx, i] += coll
                coll = np.sum(draw[collidx, :i] <= draw[collidx, i, None], axis=-1)
                relidx = np.flatnonzero(coll > tot)
                if relidx.size == 0:
                    break
                coll, tot = coll[relidx]-tot[relidx], coll[relidx]
                collidx = collidx[relidx]
    
        return draw + 1
    
    def check_result(draw, name):
        print(name[2:], '    checking plausibility...')
        import scipy.stats
        assert all(len(set(row)) == 6 for row in draw)
        assert len(set(draw.ravel())) == 50
        print('    var (exp obs)', scipy.stats.uniform(0.5, 50).var(), draw.var())
        print('    mean (exp obs)', scipy.stats.uniform(0.5, 50).mean(), draw.mean())
    
    def f_Divakar(n):
        return np.random.rand(n, 50).argpartition(6,axis=1)[:,:6]+1
    
    def f_James(n):
        return np.stack([np.random.choice(np.arange(1,51),size=6,replace=False) for i in range(n)])
    
    def f_CK(n):
        return np.array([np.random.choice(np.arange(1, 51), size=6, replace=False) for _ in range(n)])
    
    for n in (10, 1_000, 1_000_000):
        print(f'n = {n}')
        for name, func in list(globals().items()):
            if not name.startswith('f_') or not isinstance(func, types.FunctionType):
                continue
            try:
                print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                    'f(n)', globals={'f':func, 'n':n}, number=10)*100))
            except:
                print("{:16s} apparently failed".format(name[2:]))
        if(n >= 10000):
            for name, func in list(globals().items()):
                if name.startswith('f_') and isinstance(func, types.FunctionType):
    
                    check_result(func(n), name)
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看