Python

Scrapy Splash屏幕截图？

发布于 2021-01-29 17:17:59

我正在抓取一个网站，同时截取每个页面的屏幕截图。到目前为止，我已经设法拼凑了以下代码：

import json
import base64
import scrapy
from scrapy_splash import SplashRequest


class ExtractSpider(scrapy.Spider):
    name = 'extract'

    def start_requests(self):
        url = 'https://stackoverflow.com/'
        splash_args = {
            'html': 1,
            'png': 1
        }
        yield SplashRequest(url, self.parse_result, endpoint='render.json', args=splash_args)

    def parse_result(self, response):
        png_bytes = base64.b64decode(response.data['png'])

        imgdata = base64.b64decode(png_bytes)
        filename = 'some_image.png'
        with open(filename, 'wb') as f:
            f.write(imgdata)

它可以很好地进入站点（例如stackoverflow）并返回png_bytes的数据，但是当写入文件时-返回损坏的图像（不加载）。

有没有办法解决这个问题，或者找到更有效的解决方案？我读过Splash Lua脚本可以做到这一点，但一直找不到实现此目的的方法。谢谢。

关注者

被浏览

218

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

您从base64解码了两次：

       png_bytes = base64.b64decode(response.data['png'])
       imgdata = base64.b64decode(png_bytes)

只需做：

    def parse_result(self, response):
        imgdata = base64.b64decode(response.data['png'])
        filename = 'some_image.png'
        with open(filename, 'wb') as f:
            f.write(imgdata)

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看