reuters.py 文件源码

python
阅读 27 收藏 0 点赞 0 评论 0

项目:MachineLearningProject 作者: ymynem 项目源码 文件源码
def create_corpus(fileids, max_length=None):
    """
    Creates a corpus from fileids
    Removes stopwords and punctuation
    Returns a list of strings
    """
    sw = set(stopwords.words("english"))
    tokenizer = nltk.tokenize.RegexpTokenizer(r"[A-Za-z]+")
    corpus = []
    for doc in fileids:
        words = (w.lower() for w in tokenizer.tokenize(reuters.raw(doc)))
        words = [w for w in words if w not in sw]
        if max_length:
            words = words[:max_length]
        corpus.append(" ".join(words))
    return corpus
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号