使用python的urllib2和Beautifulsoup搜寻Wikipedia时删除html标签

发布于 2021-01-29 16:41:38

我正在尝试检索Wikipedia,以获取一些用于文本挖掘的数据。我正在使用python的urllib2和Beautifulsoup。我的问题是:是否有一种简单的方法可以消除我阅读的文本中不必要的标签(例如链接“
a”或“ span”)。

对于这种情况:

import urllib2
from BeautifulSoup import *
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read())
res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) # to get to content directly
paragrapgs=res[0].findAll("p") #get all paragraphs

我得到的段落带有很多参考标记,例如:

paragrapgs [0] =

<p><b>Data mining</b> (the analysis step of the <b>knowledge discovery in databases</b> process,<sup id="cite_ref-Fayyad_0-0" class="reference"><a href="#cite_note-Fayyad-0"><span>[</span>1<span>]</span></a></sup> or KDD), a relatively young and interdisciplinary field of <a href="/wiki/Computer_science" title="Computer science">computer science</a><sup id="cite_ref-acm_1-0" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-brittanica_2-0" class="reference"><a href="#cite_note-brittanica-2"><span>[</span>3<span>]</span></a></sup> is the process of discovering new patterns from large <a href="/wiki/Data_set" title="Data set">data sets</a> involving methods at the intersection of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a> and <a href="/wiki/Database_system" title="Database system">database systems</a>.<sup id="cite_ref-acm_1-1" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<sup id="cite_ref-acm_1-2" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> and involves database and <a href="/wiki/Data_management" title="Data management">data management</a>, <a href="/wiki/Data_Pre-processing" title="Data Pre-processing">data preprocessing</a>, <a href="/wiki/Statistical_model" title="Statistical model">model</a> and <a href="/wiki/Statistical_inference" title="Statistical inference">inference</a> considerations, interestingness metrics, <a href="/wiki/Computational_complexity_theory" title="Computational complexity theory">complexity</a> considerations, post-processing of found structure, <a href="/wiki/Data_visualization" title="Data visualization">visualization</a> and <a href="/wiki/Online_algorithm" title="Online algorithm">online updating</a>.<sup id="cite_ref-acm_1-3" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup></p>

有什么想法如何删除它们并获得纯文本吗?

关注者
0
被浏览
45
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    这是您可以lxml(和可爱的requests)使用的方法:

    import requests
    import lxml.html as lh
    from BeautifulSoup import UnicodeDammit
    
    URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
    HEADERS = {'User-agent': 'Mozilla/5.0'}
    
    def lhget(*args, **kwargs):
        r = requests.get(*args, **kwargs)
        html = UnicodeDammit(r.content).unicode
        tree = lh.fromstring(html)
        return tree
    
    def remove(el):
        el.getparent().remove(el)
    
    tree = lhget(URL, headers=HEADERS)
    
    el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]
    
    for ref in el.xpath("//sup[@class='reference']"):
        remove(ref)
    
    print lh.tostring(el, pretty_print=True)
    
    print el.text_content()
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看