重叠正则表达式匹配

发布于 2021-01-29 15:02:46

我试图创建下面的正则表达式:返回之间的字符串AUG和(UAGUGAUAA)从下列字符串RNA:
AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG,让所有的比赛会被发现,包括重叠的。

我尝试了几种正则表达式,最后得到了类似的结果:

matches = re.findall('(?=AUG)(\w+)(?=UAG|UGA|UAA)',"AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG")

您能告诉我我的正则表达式模式中的错误吗?

关注者
0
被浏览
113
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    用一个正则表达式执行此操作实际上是非常困难的,因为大多数用法 都不 希望重叠匹配。但是,您可以通过一些简单的迭代来做到这一点:

    regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    matches = []
    tmp = RNA
    while (match = regex.search(tmp)):
        matches.append(match)
        tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.
    
    for m in matches:
        print m.group(0)
    

    虽然,这有一些问题。您希望得到的回报是AUGUAGUGAUAA什么?是否有两个字符串要返回?还是一个?目前,您的正则表达式甚至无法捕获UAG,因为它会一直匹配UAGUGA并被截断UAA。为了解决这个问题,您可能希望使用?运算符使您的运算符变得很懒惰-
    这种方法随后将无法捕获更长的子字符串。

    也许对字符串进行两次迭代是答案,但是如果您的RNA序列包含该AUGAUGUAGUGAUAA怎么办?那里的正确行为是什么?

    通过遍历字符串及其子字符串,我可能更喜欢无正则表达式的方法:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append(RNA[start+3:])
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate.find(terminator, end) > -1):
                end = candidate.find(terminator, end)
                matches.append(candidate[:end])
                end += 1
    
    for match in matches:
        print match
    

    这样,无论如何,您都可以确保获得所有匹配项。

    如果需要跟踪每个比赛的位置,则可以修改候选数据结构以使用元组来保持起始位置:

    RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
    candidates = []
    start = 0
    
    while (RNA.find('AUG', start) > -1):
        start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
        candidates.append((RNA[start+3:], start+3))
        start += 1
    
    matches = []
    
    for candidate in candidates:
        for terminator in ['UAG', 'UGA', 'UAA']:
            end = 1;
            while(candidate[0].find(terminator, end) > -1):
                end = candidate[0].find(terminator, end)
                matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
                end += 1
    
    for match in matches:
        print "%d - %d: %s" % match
    

    打印:

    7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    7 - 31: UAGCUAACUCAGGUUACAUGGGGA
    7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    7 - 11: UAGC
    7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    27 - 49: GGGAUGACCCCGCGACUUGGAU
    27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    27 - 31: GGGA
    27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    33 - 49: ACCCCGCGACUUGGAU
    33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    78 - 85: AUCCGAG
    

    地狱,再加上三行,您甚至可以根据匹配在RNA序列中的位置对它们进行排序:

    from operator import itemgetter
    matches.sort(key=itemgetter(1))
    matches.sort(key=itemgetter(0))
    

    最终印刷版前面的内容可以使您:

    007 - 011: UAGC
    007 - 031: UAGCUAACUCAGGUUACAUGGGGA
    007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
    007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    027 - 031: GGGA
    027 - 049: GGGAUGACCCCGCGACUUGGAU
    027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    033 - 049: ACCCCGCGACUUGGAU
    033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
    033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
    033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
    033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
    078 - 085: AUCCGAG
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看