commoncrawl_crawler.py 文件源码

python

阅读 37 收藏 0 点赞 0 评论 0

项目：news-please 作者: fhamborg 项目源码文件源码

def __get_remote_index():
    """
    Gets the index of news crawl files from commoncrawl.org and returns an array of names
    :return:
    """
    # cleanup
    subprocess.getoutput("rm tmpaws.txt")
    # get the remote info
    cmd = "aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && " \
          "awk '{ print $4 }' .tmpaws.txt && " \
          "rm .tmpaws.txt"
    __logger.info('executing: %s', cmd)
    stdout_data = subprocess.getoutput(cmd)

    lines = stdout_data.splitlines()
    return lines

评论列表正在加载评论...

文章目录

提
问题

写
面经

写
文章

微信
公众号

扫码关注公众号