wikipedia.py 文件源码

python
阅读 26 收藏 0 点赞 0 评论 0

项目:samnorsk 作者: gisleyt 项目源码 文件源码
def articles(wiki_json_fn, limit=None):
    count = 0

    _, ext = os.path.splitext(wiki_json_fn)

    if ext == '.gz':
        f = GzipFile(wiki_json_fn, mode='r')
    elif ext == '.bz2':
        f = BZ2File(wiki_json_fn, mode='r')
    else:
        f = io.open(wiki_json_fn, mode='rb')

    while True:
        line = f.readline()

        if line == b'':
            break

        action = json.loads(line.decode('utf-8'))

        line = f.readline()

        if line == b'':
            break

        source = json.loads(line.decode('utf-8'))

        if is_page(action, source):
            yield {'id': action['index']['_id'], 'title': source['title'], 'text': source['text']}

            count += 1

        if limit and count > limit:
            return

        if count % 10000 == 0:
            logging.info("read %d articles" % count)

    f.close()
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号