如何用lxml中的文本替换元素?
使用lxml对ElementTreeAPI的实现,可以很容易地从XML文档中完全删除给定的元素,但是我看不到一种用某些文本一致地替换元素的简便方法。例如,给出以下输入:
input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''
…您可以使用以下命令轻松删除每个<r>
元素:
from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)
但是,您将如何用文本替换每个元素以获取输出:
<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>
在我看来,因为ElementTreeAPI通过每个元素的.text
和.tail
属性而不是树中的节点来处理文本,所以这意味着您必须根据元素是否具有同级元素,是否使用同级元素来处理很多不同的情况。现有元素具有.tail
属性,依此类推。我错过了一些简单的方法吗?
-
我认为unutbu的XSLT解决方案可能是实现目标的正确方法。
但是,通过修改
<r/>
标签的尾部然后使用,这是一种有点棘手的方法etree.strip_elements
。from lxml import etree data = '''<everything> <m>Some text before <r/></m> <m><r/> and some text after.</m> <m><r/></m> <m>Text before <r/> and after</m> <m><b/> Text after a sibling <r/> Text before a sibling<b/></m> </everything> ''' f = etree.fromstring(data) for r in f.xpath('//r'): r.tail = 'DELETED' + r.tail if r.tail else 'DELETED' etree.strip_elements(f,'r',with_tail=False) print etree.tostring(f,pretty_print=True)
给你:
<everything> <m>Some text before DELETED</m> <m>DELETED and some text after.</m> <m>DELETED</m> <m>Text before DELETED and after</m> <m><b/> Text after a sibling DELETED Text before a sibling<b/></m> </everything>