Python

启动lua脚本进行多次点击和访问

发布于 2021-01-29 15:20:58

我正在尝试抓取Google
Scholar搜索结果，
并获取与搜索匹配的每个结果的所有BiBTeX格式。现在，我有一个带有Splash的Scrapy爬虫。我有一个lua脚本，它将在获取hrefBibTeX格式的引用之前单击“引用”链接并加载模式窗口。但是看到有多个搜索结果，因此有多个“引用”链接，我需要全部单击它们并加载各个BibTeX页面。

这是我所拥有的：

import scrapy
from scrapy_splash import SplashRequest


class CiteSpider(scrapy.Spider):
    name = "cite"
    allowed_domains = ["scholar.google.com", "scholar.google.ae"]
    start_urls = [
        'https://scholar.google.ae/scholar?q="thermodynamics"&hl=en'
    ]

    script = """
        function main(splash)
          local url = splash.args.url
          assert(splash:go(url))
          assert(splash:wait(0.5))
          splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[0].click()')
          splash:wait(3)
          local href = splash:evaljs('document.querySelectorAll(".gs_citi")[0].href')
          assert(splash:go(href))
          return {
            html = splash:html(),
            png = splash:png(),
            href=href,
          }
        end
        """

    def parse(self, response):
        yield SplashRequest(self.start_urls[0], self.parse_bib,
                            endpoint="execute",
                            args={"lua_source": self.script})

    def parse_bib(self, response):
        filename = response.url.split("/")[-2] + '.html'
        with open(filename, 'wb') as f:
            f.write(response.css("body > pre::text").extract()[0])

我想我应该在执行querySelectorAll调用时将“
Cite”链接的索引传递到lua脚本中，但是我似乎找不到找到将另一个变量传递给函数的方法。另外，我认为history.back()在获取BibTeX之后，我必须做一些肮脏的javascript才能返回到原始结果页面，但是我觉得有一种更优雅的方式来处理此问题。

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

好的，所以我提出了一个可行的解决方案。首先，我们需要Lua脚本是可变的，因此我们将其设为一个函数：

def script(n):
    _script = """
        function main(splash)
          local url = splash.args.url
          local href = ""
          assert(splash:go(url))
          assert(splash:wait(0.5))
          splash:runjs('document.querySelectorAll("a.gs_nph[aria-controls=gs_cit]")[{}].click()')
          splash:wait(3)
          href = splash:evaljs('document.querySelectorAll("a.gs_citi")[0].href')
          assert(splash:go(href))
          return {}
        end
        """.format(n, "{html=splash:html(),png=splash:png(), href=href,}")
    return _script

然后，我不得不修改该parse函数，以便单击页面上的所有“引用”链接。这样做的方法是遍历页面上所有匹配的“引用”链接，然后分别单击每个链接。我使Lua脚本再次加载页面（这很脏，但我想不出其他方式），然后单击查询的“
Cite”链接的索引。它还必须进行重复的请求，因此为什么dont_filter=True存在：

def parse(self, response):
        n = len(response.css("a.gs_nph[aria-controls=gs_cit]").extract())
        for i in range(n):
            yield SplashRequest(response.url, self.parse_bib,
                                endpoint="execute",
                                args={"lua_source": script(i)},
                                dont_filter=True)

希望这可以帮助。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看