Arabic Downloads

You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the Arabic MGB-Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.

Description of Arabic MGB data

MGB_Arabic_description describes the provided for the transcription task of the Arabic MGB challenge.


Clone the Github repository and run the following from the download directory and use it like this:

sh <your_username> <your_password> train|dev <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz (~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the different data subsets.

The total size for data is 129G for training, 1.4G for dev and 1.4G eval data.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of tar gzipped archives that will be faster to download.

For both training and development, we will provide the original UTF8 text as well as normalized Buckwalter format

Evaluation data

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Textual Training Data for Language Modelling

Textual training data can be downloaded the same way as audio data by replacing wav with tgz.

For the textual training we will provide the original UTF8 text and the normalized Buckwalter format.

sh <your_username> <your_password> lmText <your_local directory> tgz

Original Transcribed Data

This data has the original transcription for each program as it shown on the Aljazeera website.

We encourage particpants to use this data to improve the intial alignmnet, as well as segmentation, and share it back with the community to improve the quality of the data

sh <your_username> <your_password> OriginalTranscript <your_local directory> tgz

Manual Transcription Guidelines for Development and Evaluation data

Scripts and Receipe

A Github repository access for software package that enables research groups not familiar with Arabic to get started quickly. The repository comes with no guarantees or responsibility , but feel free to email us or, ask to have write access to the repository.

Scoring Scripts

For scoring, sclite will be used. We are happy to open source GLM file to be used in evaluation. For any issue you wish to raise or share with others, please go to the issue page on github to contribute.