tokenization.py 文件源码

python
阅读 22 收藏 0 点赞 0 评论 0

项目:models 作者: bureaucratic-labs 项目源码 文件源码
def get_train_data(corpus, count=None, **kwargs):
    X = []
    y = []

    documents = corpus.iter_documents()
    if count:
        documents = islice(documents, count)

    for document in tqdm(documents):
        try:
            text = document.raw()
            words = document.words()

            labels = text2labels(text, words)
            features = list(text2features(text))

            X.append(features)
            y.append(labels)
        except Exception as exc:
            # TODO:
            continue

    return train_test_split(X, y, **kwargs)
评论列表
文章目录


问题


面经


文章

微信
公众号

扫码关注公众号