reuters.py 文件源码

python

阅读 28 收藏 0 点赞 0 评论 0

项目：MachineLearningProject 作者: ymynem 项目源码文件源码

def create_corpus(fileids, max_length=None):
    """
    Creates a corpus from fileids
    Removes stopwords and punctuation
    Returns a list of strings
    """
    sw = set(stopwords.words("english"))
    tokenizer = nltk.tokenize.RegexpTokenizer(r"[A-Za-z]+")
    corpus = []
    for doc in fileids:
        words = (w.lower() for w in tokenizer.tokenize(reuters.raw(doc)))
        words = [w for w in words if w not in sw]
        if max_length:
            words = words[:max_length]
        corpus.append(" ".join(words))
    return corpus

评论列表正在加载评论...

文章目录

提
问题

写
面经

写
文章

微信
公众号

扫码关注公众号