regex.py 文件源码

python
阅读 19 收藏 0 点赞 0 评论 0

项目:linkedin_recommend 作者: duggalr2 项目源码 文件源码
def tokenize_and_stem(text):
    """
    First tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    """
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            if 'intern' == token:
                token = ''
            if 'student' == token:
                token = ''
            if 'and' == token:
                token = ''
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]
    return stems
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号