带代理支持的多线程蜘蛛Python包?
除了使用urllib以外,没有人知道最有效的软件包来快速,多线程下载可通过http代理进行操作的URL吗?我知道诸如Twisted,Scrapy,libcurl等之类的东西,但我对它们还不够了解,因此他们无法做出决定,甚至他们也可以使用代理。谢谢!
-
在python中实现这一点很简单。
urlopen()函数与不需要身份验证的代理透明地一起工作。在Unix或Windows环境中,在启动Python解释器之前,将http_proxy,ftp_proxy或gopher_proxy环境变量设置为标识代理服务器的URL。
# -*- coding: utf-8 -*- import sys from urllib import urlopen from BeautifulSoup import BeautifulSoup from Queue import Queue, Empty from threading import Thread visited = set() queue = Queue() def get_parser(host, root, charset): def parse(): try: while True: url = queue.get_nowait() try: content = urlopen(url).read().decode(charset) except UnicodeDecodeError: continue for link in BeautifulSoup(content).findAll('a'): try: href = link['href'] except KeyError: continue if not href.startswith('http://'): href = 'http://%s%s' % (host, href) if not href.startswith('http://%s%s' % (host, root)): continue if href not in visited: visited.add(href) queue.put(href) print href except Empty: pass return parse if __name__ == '__main__': host, root, charset = sys.argv[1:] parser = get_parser(host, root, charset) queue.put('http://%s%s' % (host, root)) workers = [] for i in range(5): worker = Thread(target=parser) worker.start() workers.append(worker) for worker in workers: worker.join()