crawler.py 文件源码

python
阅读 22 收藏 0 点赞 0 评论 0

项目:FreeFoodCalendar 作者: Yuliang-Zou 项目源码 文件源码
def crawler(urls, max_urls):
    crawled = Set()
    queued = Set(urls)
    pairs = []
    while urls and len(crawled) < max_urls:
        page=urls.pop(0)
        if is_html(page):
            if page not in crawled:
                try:
                    print(page)
                    links=BeautifulSoup(urllib2.urlopen(page,timeout=5).read(), parseOnlyThese=SoupStrainer('a'))
                    for link in links:
                        url = domain + link['href']
                        if verify(url) and url not in queued:
                            # print(url)
                            urls.append('http://' +url)
                            # print(urls)
                            queued.add('http://' +url)
                    # print(page)
                    crawled.add(page)
                    # print(crawled)
                except:
                    continue
    return crawled,pairs
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号