check.py 文件源码

python
阅读 23 收藏 0 点赞 0 评论 0

项目:wikilinks 作者: trovdimi 项目源码 文件源码
def get_title2id(self, dump_date):
        print('get_title2id...')
        title2id = {}
        regex = re.compile(r"\((\d+),0,'(.+?)','")
        fname = '/home/ddimitrov/data/enwiki20150304_plus_clickstream/enwiki-' + dump_date + '-page.sql.gz'
        fname = '/home/ddimitrov/data/enwiki20150304_plus_clickstream/enwiki-' + dump_date + '-page.sql'
        #with gzip.GzipFile(fname, 'rb') as infile:
        with open(fname) as f:
            content = f.readlines()
            for line in content:
                line = line.decode('utf-8')
                if not line.startswith('INSERT'):
                    continue
                for pid, title in regex.findall(line):
                    title2id[DataHandler.unescape_mysql(title)] = int(pid)

        return title2id
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号