Python

在Python中浏览HTML DOM

发布于 2021-01-29 16:14:55

我正在寻找写一个Python脚本（使用3.4.3），该脚本从URL抓取HTML页面，并且可以遍历DOM来查找特定元素。

我目前有这个：

#!/usr/bin/env python
import urllib.request

def getSite(url):
    return urllib.request.urlopen(url)

if __name__ == '__main__':
    content = getSite('http://www.google.com').read()
    print(content)

当我打印内容时，它确实会打印出整个html页面，这与我想要的内容很接近……尽管我理想上希望能够浏览DOM而不是将其视为一个巨大的字符串。

我对Python还是很陌生，但是有多种其他语言（主要是Java，C＃，C
++，C，PHP，JS）的经验。我以前用Java做过类似的事情，但想在Python中尝试一下。

任何帮助表示赞赏。干杯!

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

您可以使用许多不同的模块。例如，lxml或BeautifulSoup。

这是一个lxml例子：

import lxml.html

mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)

description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag

>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

还有一个BeautifulSoup例子：

from bs4 import BeautifulSoup

mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)

description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute

>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."

注意如何BeautifulSoup返回unicode字符串，而lxml不会。根据需要，这可能有用/有害。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看