Python:基于交集的简单列表合并

发布于 2021-01-29 16:48:29

考虑一下一些整数列表:

#--------------------------------------
0 [0,1,3]
1 [1,0,3,4,5,10,...]
2 [2,8]
3 [3,1,0,...]
...
n []
#--------------------------------------

问题是合并具有至少一个公共元素的列表。因此,仅给定部分的结果如下:

#--------------------------------------
0 [0,1,3,4,5,10,...]
2 [2,8]
#--------------------------------------

对大数据(元素只是数字)执行此操作的最有效方法是什么?
是否tree需要考虑结构?现在,我通过将列表转换sets为交叉点并对其进行迭代来完成这项工作,但这很慢!而且我有一种非常基本的感觉!此外,该实现缺少某些内容(未知),因为某些列表有时仍未合并!话虽如此,如果您提议自我实现,请大方并提供一个简单的示例代码[显然,
Python 是我最喜欢的:)]或伪代码。
更新1: 这是我使用的代码:

#--------------------------------------
lsts = [[0,1,3],
        [1,0,3,4,5,10,11],
        [2,8],
        [3,1,0,16]];
#--------------------------------------

该函数是( 越野车!! ):

#--------------------------------------
def merge(lsts):
    sts = [set(l) for l in lsts]
    i = 0
    while i < len(sts):
        j = i+1
        while j < len(sts):
            if len(sts[i].intersection(sts[j])) > 0:
                sts[i] = sts[i].union(sts[j])
                sts.pop(j)
            else: j += 1                        #---corrected
        i += 1
    lst = [list(s) for s in sts]
    return lst
#--------------------------------------

结果是:

#--------------------------------------
>>> merge(lsts)
>>> [0, 1, 3, 4, 5, 10, 11, 16], [8, 2]]
#--------------------------------------

更新2: 以我的经验,下面的 Niklas Baumstark
给出的代码对于简单的情况显示更快一些。尚未测试“挂钩”给出的方法,因为它是完全不同的方法(看起来很有趣)。所有这些的测试过程可能很难或无法保证结果。我将使用的真实数据集是如此之大和复杂,因此仅通过重复就不可能跟踪任何错误。也就是说,我需要100%满足该方法的可靠性,然后才能将其推入模块中的大型代码中。就目前而言,
Niklas 的方法速度更快,简单设置的答案当然是正确的。
但是,如何确定它对于真正的大数据集是否有效? 由于我将无法直观地跟踪错误!

更新3: 请注意,此方法的可靠性比速度重要得多。希望我最终能够将Python代码转换为Fortran,以实现最佳性能。

更新4:
这篇文章中有很多有趣的观点,并慷慨地给出了答案和建设性的意见。我建议您仔细阅读所有内容。请接受我对问题的发展,令人惊奇的答案以及建设性的评论和讨论的赞赏。

关注者
0
被浏览
162
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    我的尝试:

    def merge(lsts):
        sets = [set(lst) for lst in lsts if lst]
        merged = True
        while merged:
            merged = False
            results = []
            while sets:
                common, rest = sets[0], sets[1:]
                sets = []
                for x in rest:
                    if x.isdisjoint(common):
                        sets.append(x)
                    else:
                        merged = True
                        common |= x
                results.append(common)
            sets = results
        return sets
    
    lst = [[65, 17, 5, 30, 79, 56, 48, 62],
           [6, 97, 32, 93, 55, 14, 70, 32],
           [75, 37, 83, 34, 9, 19, 14, 64],
           [43, 71],
           [],
           [89, 49, 1, 30, 28, 3, 63],
           [35, 21, 68, 94, 57, 94, 9, 3],
           [16],
           [29, 9, 97, 43],
           [17, 63, 24]]
    print merge(lst)
    

    基准测试:

    import random
    
    # adapt parameters to your own usage scenario
    class_count = 50
    class_size = 1000
    list_count_per_class = 100
    large_list_sizes = list(range(100, 1000))
    small_list_sizes = list(range(0, 100))
    large_list_probability = 0.5
    
    if False:  # change to true to generate the test data file (takes a while)
        with open("/tmp/test.txt", "w") as f:
            lists = []
            classes = [
                range(class_size * i, class_size * (i + 1)) for i in range(class_count)
            ]
            for c in classes:
                # distribute each class across ~300 lists
                for i in xrange(list_count_per_class):
                    lst = []
                    if random.random() < large_list_probability:
                        size = random.choice(large_list_sizes)
                    else:
                        size = random.choice(small_list_sizes)
                    nums = set(c)
                    for j in xrange(size):
                        x = random.choice(list(nums))
                        lst.append(x)
                        nums.remove(x)
                    random.shuffle(lst)
                    lists.append(lst)
            random.shuffle(lists)
            for lst in lists:
                f.write(" ".join(str(x) for x in lst) + "\n")
    
    setup = """
    # Niklas'
    def merge_niklas(lsts):
        sets = [set(lst) for lst in lsts if lst]
        merged = 1
        while merged:
            merged = 0
            results = []
            while sets:
                common, rest = sets[0], sets[1:]
                sets = []
                for x in rest:
                    if x.isdisjoint(common):
                        sets.append(x)
                    else:
                        merged = 1
                        common |= x
                results.append(common)
            sets = results
        return sets
    
    # Rik's
    def merge_rik(data):
        sets = (set(e) for e in data if e)
        results = [next(sets)]
        for e_set in sets:
            to_update = []
            for i, res in enumerate(results):
                if not e_set.isdisjoint(res):
                    to_update.insert(0, i)
    
            if not to_update:
                results.append(e_set)
            else:
                last = results[to_update.pop(-1)]
                for i in to_update:
                    last |= results[i]
                    del results[i]
                last |= e_set
        return results
    
    # katrielalex's
    def pairs(lst):
        i = iter(lst)
        first = prev = item = i.next()
        for item in i:
            yield prev, item
            prev = item
        yield item, first
    
    import networkx
    
    def merge_katrielalex(lsts):
        g = networkx.Graph()
        for lst in lsts:
            for edge in pairs(lst):
                g.add_edge(*edge)
        return networkx.connected_components(g)
    
    # agf's (optimized)
    from collections import deque
    
    def merge_agf_optimized(lists):
        sets = deque(set(lst) for lst in lists if lst)
        results = []
        disjoint = 0
        current = sets.pop()
        while True:
            merged = False
            newsets = deque()
            for _ in xrange(disjoint, len(sets)):
                this = sets.pop()
                if not current.isdisjoint(this):
                    current.update(this)
                    merged = True
                    disjoint = 0
                else:
                    newsets.append(this)
                    disjoint += 1
            if sets:
                newsets.extendleft(sets)
            if not merged:
                results.append(current)
                try:
                    current = newsets.pop()
                except IndexError:
                    break
                disjoint = 0
            sets = newsets
        return results
    
    # agf's (simple)
    def merge_agf_simple(lists):
        newsets, sets = [set(lst) for lst in lists if lst], []
        while len(sets) != len(newsets):
            sets, newsets = newsets, []
            for aset in sets:
                for eachset in newsets:
                    if not aset.isdisjoint(eachset):
                        eachset.update(aset)
                        break
                else:
                    newsets.append(aset)
        return newsets
    
    # alexis'
    def merge_alexis(data):
        bins = range(len(data))  # Initialize each bin[n] == n
        nums = dict()
    
        data = [set(m) for m in data]  # Convert to sets
        for r, row in enumerate(data):
            for num in row:
                if num not in nums:
                    # New number: tag it with a pointer to this row's bin
                    nums[num] = r
                    continue
                else:
                    dest = locatebin(bins, nums[num])
                    if dest == r:
                        continue  # already in the same bin
    
                    if dest > r:
                        dest, r = r, dest  # always merge into the smallest bin
    
                    data[dest].update(data[r])
                    data[r] = None
                    # Update our indices to reflect the move
                    bins[r] = dest
                    r = dest
    
        # Filter out the empty bins
        have = [m for m in data if m]
        return have
    
    def locatebin(bins, n):
        while bins[n] != n:
            n = bins[n]
        return n
    
    lsts = []
    size = 0
    num = 0
    max = 0
    for line in open("/tmp/test.txt", "r"):
        lst = [int(x) for x in line.split()]
        size += len(lst)
        if len(lst) > max:
            max = len(lst)
        num += 1
        lsts.append(lst)
    """
    
    setup += """
    print "%i lists, {class_count} equally distributed classes, average size %i, max size %i" % (num, size/num, max)
    """.format(class_count=class_count)
    
    import timeit
    print "niklas"
    print timeit.timeit("merge_niklas(lsts)", setup=setup, number=3)
    print "rik"
    print timeit.timeit("merge_rik(lsts)", setup=setup, number=3)
    print "katrielalex"
    print timeit.timeit("merge_katrielalex(lsts)", setup=setup, number=3)
    print "agf (1)"
    print timeit.timeit("merge_agf_optimized(lsts)", setup=setup, number=3)
    print "agf (2)"
    print timeit.timeit("merge_agf_simple(lsts)", setup=setup, number=3)
    print "alexis"
    print timeit.timeit("merge_alexis(lsts)", setup=setup, number=3)
    

    这些时间显然取决于基准测试的特定参数,例如类数,列表数,列表大小等。请根据需要调整这些参数以获得更有用的结果。

    以下是我的机器上针对不同参数的一些示例输出。他们表明,所有算法都有其优势和劣势,具体取决于所获得的输入类型:

    =====================
    # many disjoint classes, large lists
    class_count = 50
    class_size = 1000
    list_count_per_class = 100
    large_list_sizes = list(range(100, 1000))
    small_list_sizes = list(range(0, 100))
    large_list_probability = 0.5
    =====================
    
    niklas
    5000 lists, 50 equally distributed classes, average size 298, max size 999
    4.80084705353
    rik
    5000 lists, 50 equally distributed classes, average size 298, max size 999
    9.49251699448
    katrielalex
    5000 lists, 50 equally distributed classes, average size 298, max size 999
    21.5317108631
    agf (1)
    5000 lists, 50 equally distributed classes, average size 298, max size 999
    8.61671280861
    agf (2)
    5000 lists, 50 equally distributed classes, average size 298, max size 999
    5.18117713928
    => alexis
    => 5000 lists, 50 equally distributed classes, average size 298, max size 999
    => 3.73504281044
    
    ===================
    # less number of classes, large lists
    class_count = 15
    class_size = 1000
    list_count_per_class = 300
    large_list_sizes = list(range(100, 1000))
    small_list_sizes = list(range(0, 100))
    large_list_probability = 0.5
    ===================
    
    niklas
    4500 lists, 15 equally distributed classes, average size 296, max size 999
    1.79993700981
    rik
    4500 lists, 15 equally distributed classes, average size 296, max size 999
    2.58237695694
    katrielalex
    4500 lists, 15 equally distributed classes, average size 296, max size 999
    19.5465381145
    agf (1)
    4500 lists, 15 equally distributed classes, average size 296, max size 999
    2.75445604324
    => agf (2)
    => 4500 lists, 15 equally distributed classes, average size 296, max size 999
    => 1.77850699425
    alexis
    4500 lists, 15 equally distributed classes, average size 296, max size 999
    3.23530197144
    
    ===================
    # less number of classes, smaller lists
    class_count = 15
    class_size = 1000
    list_count_per_class = 300
    large_list_sizes = list(range(100, 1000))
    small_list_sizes = list(range(0, 100))
    large_list_probability = 0.1
    ===================
    
    niklas
    4500 lists, 15 equally distributed classes, average size 95, max size 997
    0.773697137833
    rik
    4500 lists, 15 equally distributed classes, average size 95, max size 997
    1.0523750782
    katrielalex
    4500 lists, 15 equally distributed classes, average size 95, max size 997
    6.04466891289
    agf (1)
    4500 lists, 15 equally distributed classes, average size 95, max size 997
    1.20285701752
    => agf (2)
    => 4500 lists, 15 equally distributed classes, average size 95, max size 997
    => 0.714507102966
    alexis
    4500 lists, 15 equally distributed classes, average size 95, max size 997
    1.1286110878
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看