CoVoST:Facebook发布的多语种语音转文本翻译语料库

CoVoST:Facebook发布的多语种语音转文本翻译语料库,包括11种语言(法语、德语、荷兰语、俄语、西班牙语、意大利语、土耳其语、波斯语、瑞典语、蒙古语和中文)的语音、文字转录及英文译文

Python 自然语言处理

访问GitHub主页

共137Star

详细介绍

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

License: CC0-1.0

CoVoST is built on Common Voice (2019-06-12 release). It includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations. We also provide an additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish).

Please also check out our paper for more details.

CoVoST Statistics

What's New

Getting Data

CoVoST

  1. Download the 2019-06-12 release of Common Voice (NOT the latest 2019-12-10 one from the web page) for speeches and transcripts:

  2. Download translations for all the 11 languages, where validated.<lang>_en.en are matched with the transcripts in validated.tsv.

Tatoeba Evaluation Data

  1. Download transcripts and translations and extract files to data/tt/*.

  2. Download speech data:

python get_tt_speech.py --root <mp3 download root (default to data/tt/mp3)>

Exploring Data

Google Colab example

License

License
CoVoST data CC0
Tatoeba sentences CC BY 2.0 FR
Tatoeba speeches Various CC licenses (please check out data/tt/tatoeba_s2t.<lang>_en.<lang>_lic)
Anything else CC BY-NC 4.0

Citation

Please cite as

@misc{wang2020covost,
    title={CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus},
    author={Changhan Wang and Juan Pino and Anne Wu and Jiatao Gu},
    year={2020},
    eprint={2002.01320},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Contact

Changhan Wang (changhan@fb.com), Juan Miguel Pino (juancarabina@fb.com), Jiatao Gu (jgu@fb.com)