在Python中浏览HTML DOM
我正在寻找写一个Python脚本(使用3.4.3),该脚本从URL抓取HTML页面,并且可以遍历DOM来查找特定元素。
我目前有这个:
#!/usr/bin/env python
import urllib.request
def getSite(url):
return urllib.request.urlopen(url)
if __name__ == '__main__':
content = getSite('http://www.google.com').read()
print(content)
当我打印内容时,它确实会打印出整个html页面,这与我想要的内容很接近……尽管我理想上希望能够浏览DOM而不是将其视为一个巨大的字符串。
我对Python还是很陌生,但是有多种其他语言(主要是Java,C#,C
++,C,PHP,JS)的经验。我以前用Java做过类似的事情,但想在Python中尝试一下。
任何帮助表示赞赏。干杯!
-
您可以使用许多不同的模块。例如,lxml或BeautifulSoup。
这是一个
lxml
例子:import lxml.html mysite = urllib.request.urlopen('http://www.google.com').read() lxml_mysite = lxml.html.fromstring(mysite) description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description text = description.get('content') # content attribute of the tag >>> print(text) "Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
还有一个
BeautifulSoup
例子:from bs4 import BeautifulSoup mysite = urllib.request.urlopen('http://www.google.com').read() soup_mysite = BeautifulSoup(mysite) description = soup_mysite.find("meta", {"name": "description"}) # meta tag description text = description['content'] # text of content attribute >>> print(text) u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
注意如何
BeautifulSoup
返回unicode字符串,而lxml
不会。根据需要,这可能有用/有害。