CoVoST:Facebook发布的多语种语音转文本翻译语料库
CoVoST:Facebook发布的多语种语音转文本翻译语料库,包括11种语言(法语、德语、荷兰语、俄语、西班牙语、意大利语、土耳其语、波斯语、瑞典语、蒙古语和中文)的语音、文字转录及英文译文
Python 自然语言处理
共137Star
详细介绍
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus
CoVoST is built on Common Voice (2019-06-12 release). It includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations. We also provide an additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish).
Please also check out our paper for more details.
What's New
- 2020-02-27: Google Colab example added for exploring CoVoST data with VizSeq
- 2020-02-13: Paper accepted to LREC 2020 (Oral)
- 2020-02-07: CoVoST released
Getting Data
CoVoST
-
Download the 2019-06-12 release of Common Voice (NOT the latest 2019-12-10 one from the web page) for speeches and transcripts:
-
Download translations for all the 11 languages, where
validated.<lang>_en.en
are matched with the transcripts invalidated.tsv
.
Tatoeba Evaluation Data
-
Download transcripts and translations and extract files to
data/tt/*
. -
Download speech data:
python get_tt_speech.py --root <mp3 download root (default to data/tt/mp3)>
Exploring Data
License
License | |
---|---|
CoVoST data | CC0 |
Tatoeba sentences | CC BY 2.0 FR |
Tatoeba speeches | Various CC licenses (please check out data/tt/tatoeba_s2t.<lang>_en.<lang>_lic ) |
Anything else | CC BY-NC 4.0 |
Citation
Please cite as
@misc{wang2020covost,
title={CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus},
author={Changhan Wang and Juan Pino and Anne Wu and Jiatao Gu},
year={2020},
eprint={2002.01320},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact
Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)
-
206 Star
-
30 Star
-
34 Star
-
0 Star
-
14 Star
-
1100 Star
-
434 Star
-
506 Star
-
117 Star