You will not be able to download any of the MGB-2 data without first having registered and received instructions by email. Any use of this data is bound by the Arabic MGB-2 Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.
MGB-2 Arabic description describes the provided for the transcription task of the Arabic MGB-2 challenge.
We will add soon description paper for the MGB-3 data.
Clone the Github repository and run the following from the download directory and use it like this:
sh download.sh <your_username> <your_password> train|dev <your_local directory> wav
Audio format is
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
(~108MB per hour)
Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the different data subsets.
The total size for data is 129G for training, 1.4G for dev and 1.4G eval data.
You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of tar gzipped archives that will be faster to download.
For both training and development, we will provide the original UTF8 text as well as normalized Buckwalter format
We suggest using a grapheme-based lexicon for this challenge.
If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.
Textual training data can be downloaded the same way as audio data by replacing wav with tgz.
For the textual training we will provide the original UTF8 text and the normalized Buckwalter format.
sh download.sh <your_username> <your_password> lmText <your_local directory> tgz
This data has the original transcription for each program as it shown on the Aljazeera website.
We encourage particpants to use this data to improve the intial alignmnet, as well as segmentation, and share it back with the community to improve the quality of the data
sh download.sh <your_username> <your_password> OriginalTranscript <your_local directory> tgz
A Github repository access for software package that enables research groups not familiar with Arabic ASR to get started quickly. The repository comes with no guarantees or responsibility , but feel free to email us or, ask to have write access to the repository.
Arabic Kaldi receipe can be accessible on Kaldi website. It has simialr architecture to the MGB-2 best system. However, the shared code is using gale_arabic data. This should be easy to use the shared receipe to reproduce last year best results.
N-gram SVM classifier can be accessible for dialecal ID baseline system.
For speech-to-text scoring, sclite will be used. There is an open-source Global Language Mapping (GLM) file to be used in evaluation. For any issue you wish to raise or share with others, please go to the issue page on github to contribute. Multi-Reference Word Error Rate (MR-WER) will be considered to leverage the multiple annotation for evaluating dialectal ASR for language with no orthographic rules. We also welcome ideas for ASR Multi reference evaluation.
For Arabic dialect identifcation task; preciosn and recall will reported and the overal accuracy will be used for the final evaluation .