clone element with beautifulsoup

发布于 2021-01-29 14:55:21

I have to copy a part of one document to another, but I don’t want to modify
the document I copy from.

If I use .extract() it removes the element from the tree. If I just append
selected element like document2.append(document1.tag) it still removes the
element from document1.

As I use real files I can just not save document1 after modification, but is
there any way to do this without corrupting a document?

关注者
0
被浏览
118
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    There is no native clone function in BeautifulSoup in versions before 4.4
    (released July 2015); you’d have to create a deep copy yourself, which is
    tricky as each element maintains links to the rest of the tree.

    To clone an element and all its elements, you’d have to copy all attributes
    and reset their parent-child relationships; this has to happen recursively.
    This is best done by not copying the relationship attributes and re-seat each
    recursively-cloned element:

    from bs4 import Tag, NavigableString
    
    def clone(el):
        if isinstance(el, NavigableString):
            return type(el)(el)
    
        copy = Tag(None, el.builder, el.name, el.namespace, el.nsprefix)
        # work around bug where there is no builder set
        # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
        copy.attrs = dict(el.attrs)
        for attr in ('can_be_empty_element', 'hidden'):
            setattr(copy, attr, getattr(el, attr))
        for child in el.contents:
            copy.append(clone(child))
        return copy
    

    This method is kind-of sensitive to the current BeautifulSoup version; I
    tested this with 4.3, future versions may add attributes that need to be
    copied too.

    You could also monkeypatch this functionality into BeautifulSoup:

    from bs4 import Tag, NavigableString
    
    
    def tag_clone(self):
        copy = type(self)(None, self.builder, self.name, self.namespace, 
                          self.nsprefix)
        # work around bug where there is no builder set
        # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
        copy.attrs = dict(self.attrs)
        for attr in ('can_be_empty_element', 'hidden'):
            setattr(copy, attr, getattr(self, attr))
        for child in self.contents:
            copy.append(child.clone())
        return copy
    
    
    Tag.clone = tag_clone
    NavigableString.clone = lambda self: type(self)(self)
    

    letting you call .clone() on elements directly:

    document2.body.append(document1.find('div', id_='someid').clone())
    

    My feature request to
    the BeautifulSoup project was accepted and
    tweaked

    to use the copy.copy()
    function
    ; now that
    BeautifulSoup 4.4 is released you can use that version (or newer) and do:

    import copy
    
    document2.body.append(copy.copy(document1.find('div', id_='someid')))
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看