Python - beautifulsoup - how to deal with missing closing tags

发布于 2021-01-29 16:38:01

I would like to scrape the table from html code using beautifulsoup. A snippet
of the html is shown below. When using table.findAll('tr') I get the entire
table and not only the rows. (probably because the closing tags are missing
from the html code?)

  <TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
  <TR><TD><B>Artikelbezeichnung</B>
  <TD><B>Anbieter</B>
  <TD><B>Menge</B>
  <TD><B>Taxe-EK</B>
  <TD><B>Taxe-VK</B>
  <TD><B>Empf.-VK</B>
  <TD><B>FB</B>
  <TD><B>PZN</B>
  <TD><B>Nachfolge</B>

  <TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
  <TD>Orifarm
  <TD ID=R>     30 St
  <TD ID=R>  266,67
  <TD ID=R>  336,98
  <TD>&nbsp;
  <TD>&nbsp;
  <TD>12516714
  <TD>&nbsp;

  </TABLE>

Here is my python code to show what I am struggling with:

     soup = BeautifulSoup(data, "html.parser")
     table = soup.findAll("table")[0]
     rows = table.find_all('tr')
     for tr in rows:
         print(tr.text)
关注者
0
被浏览
44
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    As stated in their
    documentation html5lib parses the document as the web browser does (Like
    lxmlin this case). It’ll try to fix your document tree by adding/closing
    tags when needed.

    In your example I’ve used lxml as the parser and it gave the following result:

    soup = BeautifulSoup(data, "lxml")
    table = soup.findAll("table")[0]
    rows = table.find_all('tr')
    for tr in rows:
        print(tr.get_text(strip=True))
    

    Note that lxml added html & body tags because they weren’t present in the
    source (It’ll try to create a well formed document as previously state).



知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看