Python

带代理支持的多线程蜘蛛Python包？

发布于 2021-01-29 17:45:29

除了使用urllib以外，没有人知道最有效的软件包来快速，多线程下载可通过http代理进行操作的URL吗？我知道诸如Twisted，Scrapy，libcurl等之类的东西，但我对它们还不够了解，因此他们无法做出决定，甚至他们也可以使用代理。谢谢！

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。

在python中实现这一点很简单。

urlopen（）函数与不需要身份验证的代理透明地一起工作。在Unix或Windows环境中，在启动Python解释器之前，将http_proxy，ftp_proxy或gopher_proxy环境变量设置为标识代理服务器的URL。

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

知识点

Python

面圈网VIP题库全新上线，海量真题题库资源。 90大类考试，超10万份考试真题开放下载啦

去下载看看