data.py 文件源码

python

阅读 47 收藏 0 点赞 0 评论 0

项目：augmented_seq2seq 作者: suriyadeepan 项目源码文件源码

def index_(tokenized_sentences, vocab_size):
    # get frequency distribution
    freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    # get vocabulary of 'vocab_size' most used words
    vocab = freq_dist.most_common(vocab_size)
    # index2word
    index2word = ['_'] + [UNK] + [ x[0] for x in vocab ]
    # word2index
    word2index = dict([(w,i) for i,w in enumerate(index2word)] )
    return index2word, word2index, freq_dist

评论列表正在加载评论...

文章目录

提
问题

写
面经

写
文章

微信
公众号

扫码关注公众号