data.py 文件源码

python
阅读 41 收藏 0 点赞 0 评论 0

项目:augmented_seq2seq 作者: suriyadeepan 项目源码 文件源码
def index_(tokenized_sentences, vocab_size):
    # get frequency distribution
    freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
    # get vocabulary of 'vocab_size' most used words
    vocab = freq_dist.most_common(vocab_size)
    # index2word
    index2word = ['_'] + [UNK] + [ x[0] for x in vocab ]
    # word2index
    word2index = dict([(w,i) for i,w in enumerate(index2word)] )
    return index2word, word2index, freq_dist
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号