Python

抓取的HTML与源代码有何不同？

发布于 2021-01-29 15:04:25

我正在从网站上抓取餐馆列表（已获得许可），但我遇到了问题。网站上的html
python片段与源代码中的html不同。在python的html中找到其网站上不到一半的餐厅。这是我的代码如下所示：

import requests
from bs4 import BeautifulSoup
from tempfile import TemporaryFile
import xlwt

url = 'https://www.example.com'

r = requests.get(url)
data = BeautifulSoup(r.text)
soup = data.find_all('span',{'class':'restaurant_name'})
print soup

现在，我知道这很不方便，但由于公司不允许我使用，因此无法显示html。我只是想知道你们是否一般都知道python下载的html与源代码中的html有什么不同，以及我可以做些什么。

提前致谢！

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

您可以为此目的使用Selenium。它将像浏览器一样在运行时呈现您的网页。您可以将Selenium与firefox，chrome或phantomjs一起使用。

硒

由于大多数站点都是由现代JavaScript框架组成的，因此我们基本上使用硒来完全呈现我们的网页。通常，它用于开发爬网程序/爬网程序以从网站的不同页面收集数据，或者Selenium也用于网络自动化。

有关Selenium的更多信息，请在此处阅读：http:
//selenium-
python.readthedocs.io/另外，我还为初学者撰写了有关Slenium的博客文章。也检查一下这个 http://blog.hassanmehmood.com/creating-
your-first-crawler-in-python/

例

import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

profile_link = 'http://hassanmehmood.com'


class TitleScrapper(object):

    def __init__(self):

        fp = webdriver.FirefoxProfile()
        fp.set_preference("browser.startup.homepage_override.mstone", "ignore") #Avoid startup screen
        fp.set_preference("startup.homepage_welcome_url.additional",  "about:blank")

        self.driver = webdriver.Firefox(firefox_profile=fp)
        self.driver.set_window_size(1120, 550)

    def scrape_profile(self):
        self.driver.get(profile_link)
        print self.driver.title
        self.driver.close()

    def scrape(self):
        self.scrape_profile()


if __name__ == '__main__':
    scraper = TitleScrapper()
    scraper.scrape()

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看