Python

如何将刮擦图像下载到动态文件夹中？

发布于 2021-01-29 15:05:31

我可以通过scrapy将图像下载到“
Full”文件夹中，但是full/session_id每次scrapy运行时，我都需要使目标文件夹的名称动态化，例如。

有什么办法吗？

关注者

被浏览

119

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

我还没有使用过ImagesPipeline，但是按照文档操作，我会重写item_completed(results, items, info)。

原始定义是：

def item_completed(self, results, item, info):
    if self.IMAGES_RESULT_FIELD in item.fields:
        item[self.IMAGES_RESULT_FIELD] = [x for ok, x in results if ok]
    return item

这应该为您提供下载图像的结果集，包括路径（似乎一个项目上可能有很多图像）。

如果现在在子类中更改此方法以在设置路径之前移动所有文件，则它应该可以根据需要工作。您可以将目标文件夹设置为项目item['session_path']。您必须在每个项目上设置此设置，然后才能从蜘蛛退回/生产您的项目。

带有重写方法的子类如下所示：

import os, os.path
from scrapy.contrib.pipeline.images import ImagesPipeline

class SessionImagesPipeline(ImagesPipeline):
    def item_completed(self, results, item, info):
        # iterate over the local file paths of all downloaded images
        for result in [x for ok, x in results if ok]:
            path = result['path']
            # here we create the session-path where the files should be in the end
            # you'll have to change this path creation depending on your needs
            target_path = os.path.join((item['session_path'], os.basename(path)))

            # try to move the file and raise exception if not possible
            if not os.rename(path, target_path):
                raise ImageException("Could not move image to target folder")

            # here we'll write out the result with the new path,
            # if there is a result field on the item (just like the original code does)
            if self.IMAGES_RESULT_FIELD in item.fields:
                result['path'] = target_path
                item[self.IMAGES_RESULT_FIELD].append(result)

        return item

更好的办法是，item在您匆忙运行期间在配置中而不是在中设置所需的会话路径。为此，我认为您必须找出在应用程序运行时如何设置配置的方法，并且必须重写构造函数。

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看