utils_clm.py 文件源码

python
阅读 19 收藏 0 点赞 0 评论 0

项目:KGP-ASR 作者: KGPML 项目源码 文件源码
def preprocess(self, input_file, vocab_file, tensor_file):
        f = open(input_file, "r")
        data = f.read()
    f.close()
        #data = data.lower()
    #data = re.sub("[^a-z, ']+"," ",data) # replace unknown sumbols with space
    counter = collections.Counter(data)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        self.chars, _ = zip(*count_pairs)
        self.vocab_size = len(self.chars)
        self.vocab = dict(zip(self.chars, range(len(self.chars))))
        print(self.vocab)
    with open(vocab_file, 'wb') as f:
            cPickle.dump(self.chars, f)
        #print(map(self.vocab.get,data))
    self.tensor = np.array(list(map(self.vocab.get, data)))
        np.save(tensor_file, self.tensor)
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号