如何在数据框中使用word_tokenize

发布于 2021-01-29 15:09:21

我最近开始使用nltk模块进行文本分析。我陷入了困境。我想在数据帧上使用word_tokenize,以便获取在数据帧的特定行中使用的所有单词。

data example:
       text
1.   This is a very good site. I will recommend it to others.
2.   Can you please give me a call at 9983938428. have issues with the listings.
3.   good work! keep it up
4.   not a very helpful site in finding home decor.

expected output:

1.   'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
2.   'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
3.   'good','work','!','keep','it','up'
4.   'not','a','very','helpful','site','in','finding','home','decor'

基本上,我想分离所有单词并找到数据框中每个文本的长度。

我知道word_tokenize可以用于字符串,但是如何将其应用于整个数据框?

请帮忙!

提前致谢…

关注者
0
被浏览
60
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    您可以使用DataFrame API的 apply 方法:

    import pandas as pd
    import nltk
    
    df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
    df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
    

    输出:

    >>> df
                                               sentences  \
    0  This is a very good site. I will recommend it ...   
    1  Can you please give me a call at 9983938428. h...   
    2                              good work! keep it up
    
                                         tokenized_sents  
    0  [This, is, a, very, good, site, ., I, will, re...  
    1  [Can, you, please, give, me, a, call, at, 9983...  
    2                      [good, work, !, keep, it, up]
    

    要查找每个文本的长度,请尝试再次使用 applylambda函数

    df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)
    
    >>> df
                                               sentences  \
    0  This is a very good site. I will recommend it ...   
    1  Can you please give me a call at 9983938428. h...   
    2                              good work! keep it up
    
                                         tokenized_sents  sents_length  
    0  [This, is, a, very, good, site, ., I, will, re...            14  
    1  [Can, you, please, give, me, a, call, at, 9983...            15  
    2                      [good, work, !, keep, it, up]             6
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看