MGB Challenge - Arabic Downloads

Arabic Downloads

You will not be able to download any of the MGB-2 data without first having registered and received instructions by email. Any use of this data is bound by the Arabic MGB-2 Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.

Description of Arabic MGBdata

MGB-2 Arabic description describes the provided for the transcription task of the Arabic MGB-2 challenge.

We will add soon description paper for the MGB-3 data.

Audio

Clone the Github repository and run the following from the download directory and use it like this:

sh download.sh <your_username> <your_password> train|dev <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz (~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the different data subsets.

The total size for data is 129G for training, 1.4G for dev and 1.4G eval data.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of tar gzipped archives that will be faster to download.

For both training and development, we will provide the original UTF8 text as well as normalized Buckwalter format

MGB-2 training Data (130 MB).
MGB-2 development data this is the latest development data, including overlap-speech, baseline WER baseline can be found here (1.2 MB).
MGB-3 adaption data The adaption data is about 5 hours multi-genre Egyptian data, each program has four differnt annotations.
MGB-3 development data The development data is about 5 hours multi-genre Egyptian data, each program has four differnt annotations.
MGB-3 dialectal identifcation data and codeThis repository has the data and baseline code for dialect identifcation .
MGB-2 ivector features using bottle-neck features. This was calculated per segment for all the MGB-2 data (maybe useful for unsupervisied dialect identifcation).

Evaluation data

Task 1: Speech-to-text transcription of broadcast television including overlap-speech, and non overlap-speech: In 2017, we will use the same test set from MGB-2 for regression evaluation

baseline segmentation.
audio files.
MGB-3 Test data 5 hours multi-genre Egyptian data. We release the genre for each program and the overlap speech information. Only non-overlap speech segments will be used for the official score.

Task 2: Arabic dialect identifcation: We will release two hours for each dialect at the evaluation time.

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

The grapheme-based lexicon is 1:1 word to grapheme mapping. You can find the latest version of the lexicon in GitHub repository.
Participants are encouraged to improve the lexicon and share it.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Textual Training Data for Language Modelling

Textual training data can be downloaded the same way as audio data by replacing wav with tgz.

For the textual training we will provide the original UTF8 text and the normalized Buckwalter format.

sh download.sh <your_username> <your_password> lmText <your_local directory> tgz

LanguageModelText (600 MB).

Original Transcribed Data (MGB-2)

This data has the original transcription for each program as it shown on the Aljazeera website.

We encourage particpants to use this data to improve the intial alignmnet, as well as segmentation, and share it back with the community to improve the quality of the data

sh download.sh <your_username> <your_password> OriginalTranscript <your_local directory> tgz

OriginalTranscription (22 MB).

Manual Transcription Guidelines for Development and Evaluation data

Split on silence according to the shape of the wave
Segment duration should be more than 3 sec.
Verbatim transcription
Mark hesitation and repetition with # (like: today is# is )
Mark segments with overlap speech by "###" in the beginning

Scripts and Receipe

A Github repository access for software package that enables research groups not familiar with Arabic ASR to get started quickly. The repository comes with no guarantees or responsibility , but feel free to email us or, ask to have write access to the repository.

Arabic Kaldi receipe can be accessible on Kaldi website. It has simialr architecture to the MGB-2 best system. However, the shared code is using gale_arabic data. This should be easy to use the shared receipe to reproduce last year best results.

N-gram SVM classifier can be accessible for dialecal ID baseline system.

Scoring Scripts

For speech-to-text scoring, sclite will be used. There is an open-source Global Language Mapping (GLM) file to be used in evaluation. For any issue you wish to raise or share with others, please go to the issue page on github to contribute. Multi-Reference Word Error Rate (MR-WER) will be considered to leverage the multiple annotation for evaluating dialectal ASR for language with no orthographic rules. We also welcome ideas for ASR Multi reference evaluation.
For Arabic dialect identifcation task; preciosn and recall will reported and the overal accuracy will be used for the final evaluation .