converter.py 文件源码

python
阅读 29 收藏 0 点赞 0 评论 0

项目:KDDCUP2016 作者: hugochan 项目源码 文件源码
def totxt(self, paperid):
        '''
        Converts HTML to pure text by extracting all text elements from the the HTML.  
        '''
        infile  = config.HTML_PATH % paperid
        outfile = config.TXT_PATH % paperid

        h = html.parse(infile)
        pars = h.xpath("//p")
        text = ''.join([par.text_content() for par in pars])
        text = text.replace("-\n", "")

        with open(outfile, 'w') as f :
            f.write(text.encode("UTF-8"))
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号