从大量的.txt文件及其频率生成Ngram(Unigram,Bigram等)
我需要用NLTK编写一个程序,将程序集(大量txt文件)分解为unigram,bigrams,trigram,fourgrams和Fivegrams。我已经编写了将文件输入程序的代码。
输入的是300个用英语编写的.txt文件,我希望以Ngrams的形式输出,特别是频率计数。
我知道NLTK具有Bigram和Trigram模块:http
://www.nltk.org/_modules/nltk/model/ngram.html
但我还不具备将它们输入我程序的能力。
输入:txt文件而不是单个句子
输出示例:
Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]
Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
到目前为止,我的代码是:
from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2
def generate(file, ngrams):
for gram in range(0, ngrams):
print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))
for file in files.fileids():
generate(file, ngrams)
任何帮助下一步应该做什么?
-
只需使用
ntlk.ngrams
。import nltk from nltk import word_tokenize from nltk.util import ngrams from collections import Counter text = "I need to write a program in NLTK that breaks a corpus (a large collection of \ txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ I need to write a program in NLTK that breaks a corpus" token = nltk.word_tokenize(text) bigrams = ngrams(token,2) trigrams = ngrams(token,3) fourgrams = ngrams(token,4) fivegrams = ngrams(token,5) print Counter(bigrams) Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2, ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2, ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2, ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1, (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of', 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1, ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, ('collection', 'of'): 1, ('files', ')'): 1})
更新(使用纯python):
import os corpus = [] path = '.' for i in os.walk(path).next()[2]: if i.endswith('.txt'): f = open(os.path.join(path,i)) corpus.append(f.read()) frequencies = Counter([]) for text in corpus: token = nltk.word_tokenize(text) bigrams = ngrams(token, 2) frequencies += Counter(bigrams)