如何使用rowspan和colspan解析表

发布于 2021-01-29 17:15:55

首先,我阅读了使用rowspan和colspan解析表的信息。我什至回答了这个问题。在将此标记为重复之前,请先阅读。

<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="1">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

它将呈现为

+---+---+---+
| A | B |   |
+---+---+   |
|   | D |   |
+ C +---+---+
|   | E | F |
+---+---+---+
| G | H |   |
+---+---+---+



<table border="1">
  <tr>
    <th>A</th>
    <th>B</th>
  </tr>
  <tr>
    <td rowspan="2">C</td>
    <td rowspan="2">D</td>
  </tr>
  <tr>
    <td>E</td>
    <td>F</td>
  </tr>
  <tr>
    <td>G</td>
    <td>H</td>
  </tr>
</table>

但是,这将像这样渲染。

+---+---+-------+
| A | B |       |
+---+---+-------+
|   |   |       |
| C | D +---+---+
|   |   | E | F |
+---+---+---+---+
| G | H |       |
+---+---+---+---+

我以前回答的代码只能解析具有第一行中定义的所有列的表。

def table_to_2d(table_tag):
    rows = table_tag("tr")
    cols = rows[0](["td", "th"])
    table = [[None] * len(cols) for _ in range(len(rows))]
    for row_i, row in enumerate(rows):
        for col_i, col in enumerate(row(["td", "th"])):
            insert(table, row_i, col_i, col)
    return table


def insert(table, row, col, element):
    if row >= len(table) or col >= len(table[row]):
        return
    if table[row][col] is None:
        value = element.get_text()
        table[row][col] = value
        if element.has_attr("colspan"):
            span = int(element["colspan"])
            for i in range(1, span):
                table[row][col+i] = value
        if element.has_attr("rowspan"):
            span = int(element["rowspan"])
            for i in range(1, span):
                table[row+i][col] = value
    else:
        insert(table, row, col + 1, element)

soup = BeautifulSoup('''
    <table>
        <tr><th>1</th><th>2</th><th>5</th></tr>
        <tr><td rowspan="2">3</td><td colspan="2">4</td></tr>
        <tr><td>6</td><td>7</td></tr>
    </table>''', 'html.parser')
print(table_to_2d(soup.table))

我的问题是如何将表解析为2D数组,以 准确 表示其在浏览器中的呈现方式。或者有人可以解释浏览器如何呈现表也很好。

关注者
0
被浏览
45
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    您不能只数tdth单元格,不。您必须对表进行扫描以获取每一行的列数,并将上一行中任何活动的行跨度添加到该计数中。

    在用行跨度解析表的另一种情况下,我跟踪每个列号的行跨数,以确保来自不同单元格的数据以正确的列结尾。这里可以使用类似的技术。

    第一计数列;只保留最高的数字。保留行数为2或更大的列表,并为您处理的每一行列的每行减去1。这样,您就知道每一行有多少“额外”列。以最高的列数来构建输出矩阵。

    接下来,再次遍历行和单元格,这次跟踪从列号到活动计数的字典中的行跨度。同样,将值大于等于2的任何内容都保留到下一行。然后移动列号以说明活动的任何行跨度;如果在第td0列上有活动的行跨度,则行的第一个实际上是第二个,等等。

    您的代码将复制的列和行的值重复复制到输出中;我通过在给定单元格(每个默认为1)的colspanrowspan数字上创建循环以多次复制值来实现相同目的。我忽略了重叠的单元格;的HTML表格规范指出重叠的小区是一个错误,它是由用户代理来解决冲突。在下面的代码中,colspan胜过rowpan单元。

    from itertools import product
    
    def table_to_2d(table_tag):
        rowspans = []  # track pending rowspans
        rows = table_tag.find_all('tr')
    
        # first scan, see how many columns we need
        colcount = 0
        for r, row in enumerate(rows):
            cells = row.find_all(['td', 'th'], recursive=False)
            # count columns (including spanned).
            # add active rowspans from preceding rows
            # we *ignore* the colspan value on the last cell, to prevent
            # creating 'phantom' columns with no actual cells, only extended
            # colspans. This is achieved by hardcoding the last cell width as 1. 
            # a colspan of 0 means “fill until the end” but can really only apply
            # to the last cell; ignore it elsewhere. 
            colcount = max(
                colcount,
                sum(int(c.get('colspan', 1)) or 1 for c in cells[:-1]) + len(cells[-1:]) + len(rowspans))
            # update rowspan bookkeeping; 0 is a span to the bottom. 
            rowspans += [int(c.get('rowspan', 1)) or len(rows) - r for c in cells]
            rowspans = [s - 1 for s in rowspans if s > 1]
    
        # it doesn't matter if there are still rowspan numbers 'active'; no extra
        # rows to show in the table means the larger than 1 rowspan numbers in the
        # last table row are ignored.
    
        # build an empty matrix for all possible cells
        table = [[None] * colcount for row in rows]
    
        # fill matrix from row data
        rowspans = {}  # track pending rowspans, column number mapping to count
        for row, row_elem in enumerate(rows):
            span_offset = 0  # how many columns are skipped due to row and colspans 
            for col, cell in enumerate(row_elem.find_all(['td', 'th'], recursive=False)):
                # adjust for preceding row and colspans
                col += span_offset
                while rowspans.get(col, 0):
                    span_offset += 1
                    col += 1
    
                # fill table data
                rowspan = rowspans[col] = int(cell.get('rowspan', 1)) or len(rows) - row
                colspan = int(cell.get('colspan', 1)) or colcount - col
                # next column is offset by the colspan
                span_offset += colspan - 1
                value = cell.get_text()
                for drow, dcol in product(range(rowspan), range(colspan)):
                    try:
                        table[row + drow][col + dcol] = value
                        rowspans[col + dcol] = rowspan
                    except IndexError:
                        # rowspan or colspan outside the confines of the table
                        pass
    
            # update rowspan bookkeeping
            rowspans = {c: s - 1 for c, s in rowspans.items() if s > 1}
    
        return table
    

    这样可以正确解析您的样本表:

    >>> from pprint import pprint
    >>> pprint(table_to_2d(soup.table), width=30)
    [['1', '2', '5'],
     ['3', '4', '4'],
     ['3', '6', '7']]
    

    并处理您的其他示例;第一张桌子:

    >>> table1 = BeautifulSoup('''
    ... <table border="1">
    ...   <tr>
    ...     <th>A</th>
    ...     <th>B</th>
    ...   </tr>
    ...   <tr>
    ...     <td rowspan="2">C</td>
    ...     <td rowspan="1">D</td>
    ...   </tr>
    ...   <tr>
    ...     <td>E</td>
    ...     <td>F</td>
    ...   </tr>
    ...   <tr>
    ...     <td>G</td>
    ...     <td>H</td>
    ...   </tr>
    ... </table>''', 'html.parser')
    >>> pprint(table_to_2d(table1.table), width=30)
    [['A', 'B', None],
     ['C', 'D', None],
     ['C', 'E', 'F'],
     ['G', 'H', None]]
    

    第二个:

    >>> table2 = BeautifulSoup('''
    ... <table border="1">
    ...   <tr>
    ...     <th>A</th>
    ...     <th>B</th>
    ...   </tr>
    ...   <tr>
    ...     <td rowspan="2">C</td>
    ...     <td rowspan="2">D</td>
    ...   </tr>
    ...   <tr>
    ...     <td>E</td>
    ...     <td>F</td>
    ...   </tr>
    ...   <tr>
    ...     <td>G</td>
    ...     <td>H</td>
    ...   </tr>
    ... </table>
    ... ''', 'html.parser')
    >>> pprint(table_to_2d(table2.table), width=30)
    [['A', 'B', None, None],
     ['C', 'D', None, None],
     ['C', 'D', 'E', 'F'],
     ['G', 'H', None, None]]
    

    最后但并非最不重要的一点是,代码正确地处理了超出实际表的"0"跨度和跨度(延伸至末尾),如以下示例所示:

    <table border="1">
      <tr>
        <td rowspan="3">A</td>
        <td rowspan="0">B</td>
        <td>C</td>
        <td colspan="2">D</td>
      </tr>
      <tr>
        <td colspan="0">E</td>
      </tr>
    </table>
    

    即使有rowpan和colspan值会让您相信可能会有3和5,也有两行包含4个单元格:

    +---+---+---+---+
    |   |   | C | D |
    | A | B +---+---+
    |   |   |   E   |
    +---+---+-------+
    

    这种超限的处理方式与浏览器相同。它们将被忽略,并且0跨度扩展到其余的行或列:

    >>> span_demo = BeautifulSoup('''
    ... <table border="1">
    ...   <tr>
    ...     <td rowspan="3">A</td>
    ...     <td rowspan="0">B</td>
    ...     <td>C</td>
    ...     <td colspan="2">D</td>
    ...   </tr>
    ...   <tr>
    ...     <td colspan="0">E</td>
    ...   </tr>
    ... </table>''', 'html.parser')
    >>> pprint(table_to_2d(span_demo.table), width=30)
    [['A', 'B', 'C', 'D'],
     ['A', 'B', 'E', 'E']]
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看