有效地计算字符串中的单词频率

发布于 2021-01-29 17:33:10

我正在解析一长串文本,并计算每个单词在Python中出现的次数。我有一个可以正常工作的函数,但是我正在寻找有关是否有办法提高效率(就速度而言)以及是否甚至有python库函数可以为我做到这一点的建议,因此,我不打算重新设计轮子?

您能建议一种更有效的方法来计算长字符串(通常超过1000个单词)中出现的最常见单词吗?

还有什么最好的方法将字典排序到列表中,其中第一个元素是最常见的单词,第二个元素是第二个最常见的单词,依此类推?

test = """abc def-ghi jkl abc
abc"""

def calculate_word_frequency(s):
    # Post: return a list of words ordered from the most
    # frequent to the least frequent

    words = s.split()
    freq  = {}
    for word in words:
        if freq.has_key(word):
            freq[word] += 1
        else:
            freq[word] = 1
    return sort(freq)

def sort(d):
    # Post: sort dictionary d into list of words ordered
    # from highest freq to lowest freq
    # eg: For {"the": 3, "a": 9, "abc": 2} should be
    # sorted into the following list ["a","the","abc"]

    #I have never used lambda's so I'm not sure this is correct
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))

print calculate_word_frequency(test)
关注者
0
被浏览
188
1 个回答
  • 面试哥
    面试哥 2021-01-29
    为面试而生,有面试问题,就找面试哥。

    用途collections.Counter

    >>> from collections import Counter
    >>> test = 'abc def abc def zzz zzz'
    >>> Counter(test.split()).most_common()
    [('abc', 2), ('zzz', 2), ('def', 2)]
    


知识点
面圈网VIP题库

面圈网VIP题库全新上线,海量真题题库资源。 90大类考试,超10万份考试真题开放下载啦

去下载看看