Arabic Downloads

You will not be able to download any of the MGB-2 data without first having registered and received instructions by email. Any use of this data is bound by the Arabic MGB-2 Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.

Description of Arabic MGBdata

MGB-2 Arabic description describes the provided for the transcription task of the Arabic MGB-2 challenge.

We will add soon description paper for the MGB-3 data.

Audio

Clone the Github repository and run the following from the download directory and use it like this:

sh download.sh <your_username> <your_password> train|dev <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz (~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the different data subsets.

The total size for data is 129G for training, 1.4G for dev and 1.4G eval data.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of tar gzipped archives that will be faster to download.

For both training and development, we will provide the original UTF8 text as well as normalized Buckwalter format

Evaluation data

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Textual Training Data for Language Modelling

Textual training data can be downloaded the same way as audio data by replacing wav with tgz.

For the textual training we will provide the original UTF8 text and the normalized Buckwalter format.

sh download.sh <your_username> <your_password> lmText <your_local directory> tgz

Original Transcribed Data (MGB-2)

This data has the original transcription for each program as it shown on the Aljazeera website.

We encourage particpants to use this data to improve the intial alignmnet, as well as segmentation, and share it back with the community to improve the quality of the data

sh download.sh <your_username> <your_password> OriginalTranscript <your_local directory> tgz

Manual Transcription Guidelines for Development and Evaluation data

Scripts and Receipe

A Github repository access for software package that enables research groups not familiar with Arabic ASR to get started quickly. The repository comes with no guarantees or responsibility , but feel free to email us or, ask to have write access to the repository.

Arabic Kaldi receipe can be accessible on Kaldi website. It has simialr architecture to the MGB-2 best system. However, the shared code is using gale_arabic data. This should be easy to use the shared receipe to reproduce last year best results.

N-gram SVM classifier can be accessible for dialecal ID baseline system.

Scoring Scripts