utils.py 文件源码

python
阅读 21 收藏 0 点赞 0 评论 0

项目:LinguisticAnalysis 作者: DucAnhPhi 项目源码 文件源码
def preprocess(tweet):
    preprocessed = copy.copy(tweet)
    preprocessed = preprocessed.lower()

    # remove some emoticons the TweetTokenizer does not know
    preprocessed = remove_emoticons(preprocessed)

    # split contractions like "he's" -> "he s",
    # by using imported contractions dictionary
    preprocessed = split_contractions(preprocessed)

    # split compounds like "next-level" -> "next level"
    preprocessed = split_compounds(preprocessed)

    # remove links
    preprocessed = remove_links(preprocessed)

    # remove all special characters and return tokenized text
    preprocessed = remove_special_characters(preprocessed)

    preprocessed = remove_empty_sentences(preprocessed)

    return preprocessed
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号