MGB Challenge

The challenge

The fifth edition of the Multi-Genre Broadcast (MGB-5) is an evaluation of speech recognition and 17 Arabic dialect identification using youtube recordings in dialectal Arabic.

The MGB-5 is using 16 hours multi-genre data collected from different YouTube channels.

In 2019, the challenge features two new Arabic tracks based on youtube recordings. It was an official challenge at the 2017 IEEE Automatic Speech Recognition and Understanding Workshop.

Background

The fifth edition of the Multi-Genre Broadcast Challenge: MGB-5 is evaluation of speech recognition and dialect identification techniques using YouTube recordings. The data is highly diverse, spanning the whole range of YouTube genres. Our aim is to encourage researchers to evaluate the latest research techniques using large quantities of realistic data with immediate real-world applications, as well as encouraging approaches to adaptation, semi-supervised and unsupervised learning. You can find more details about the MGB-5 here.

In addition to the 1,200 hours used in 2016 from Aljazeera TV programs, the MGB-5 explores multi-genre data; comedy, cooking, cultural, environment, family-kids, fashion, movies-drama, sports and science talks (TEDX).

Moroccan Arabic Automatic Speech Recognition

The MGB-5 Arabic data comprises 14 hours of Moroccan Arabic speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. We assume that the MGB-5 data is not enough by itself to build robust speech recognition systems, but could be useful for adaptation, and for hyper-parameter tuning of models built using the MGB-2 data. Therefore, we suggest to reuse the MGB-2 training data in this challenge, and consider the provided in-domain data as (supervised) adaptation data.

Given that dialectal Arabic does not have a clearly defined orthography, different people tend to write the same word in slightly different forms. Therefore, instead of developing strict guidelines to ensure a standardized orthography, variations in spelling are allowed. Thus multiple transcriptions were produced, allowing transcribers to write the transcripts as they deemed correct. Every file has been segmented and transcribed by four different Moroccan annotators.

The 93 YouTube clips have been manually labelled for speech, non-speech segments. About 12 minutes from each program were selected for transcription. The resulting speech segments were then distributed into train, development and test data sets as follows:

Training data: 10.2 hours from 69 programs
Development data: 1.8 hours from 10 programs
Testing data: 2.0 hours from 14 programs

In addition to the transcribed 14 hours, the full programs are also provided, which amounts 48 hours for the 93 programs. This data can be used for in-domain speech or genre adaptation.

You can find sample here: > audio, segmentation, transcription in Arabic and transcription in Buckwalter.

You can find the MGB-5 ASR baseline system here.

If you want to access the MGB-5 Moroccan Dialect corpus, you need to sign the license agreement and email us at: info@arabicspeech.org.

Fine-grained Arabic Dialect Identification (ADI)

The task of ADI is dialect identification of speech from YouTube to one of the 17 dialects (ADI17). The previous studies on Arabic dialect identification using audio signal is limited to 5 dialect classes by lack of speech corpus. To present a fine-grained analysis on the Arabic dialect speech, we collected Arabic dialect from YouTube.

For Train set, about 3,000 hours of Arabic dialect speech data from 17 countries on the Arabic world was collected from YouTube. Since we collected the speech by considering the YouTube channels in a specific country, certain that the dataset might have some labeling errors. For this reason, we have two sub-tracks for the ADI task, supervised learning track and unsupervised track. Thus, the label of the train set can be either used or not and it completely depends on the choice of participants.

For the Dev and Test set, about 280 hours speech data was collected from YouTube. After automatic speaker linking and dialect labeling by human annotators, we selected 57 hours of speech dataset to use as Dev and Test set for performance evaluation. The test dataset was considered to have three sub-categories by the segment duration to represent short (under 5 sec), medium(between 5 sec and 20 sec), long duration (over 20 sec) of the dialectal speech.

You can find the ADI17 data here.

You can find the ADI17 baseline system here.

Evaluation tasks

Particpants can enter any of two tasks:

Speech-to-text transcription of broadcast data
Fine-grained Arabic Dialect Identification of Arabic audio. For this task, we are releasing about 3,000 hours across 17 Arabic countries..

Each task has one primary evaluation conditions and up to three contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Rules for all tasks

Only audio data and language model data supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.
Any lexicon can be used.

Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap speech will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at segment level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis. In the MGB-3 competition, we will share the multiple reference word error rate (MR-WER) to explore the non-orthographic aspect in dialectal Arabic for scoring.

In the MGB5, we released 5 hours for adaption and 5 hours for development to explore using them to get better results on dialectal data such as Egyptian comedy. We assume the MGB-5 data is not enough by itself to build robust Arabic Speech recognition system, but could be quite useful for adaptation, and hyper-parameter tuning for models built using the MGB-2 data.

Manual Transcription Guidelines for Development and Evaluation data

Split on silence according to the shape of the wave
Segment duration should be more than 3 sec.
Verbatim transcription
Mark hesitation and repetition with # (like: today is# is )
Mark segments with overlap speech by "###" in the beginning

Download Instructions

Dialectal identifcation data can be found in this repository.
speech-to-text data can be found in this link.

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

The grapheme-based lexicon is 1:1 word to grapheme mapping. You can find the latest version of the lexicon in GitHub repository.
Participants are encouraged to improve the lexicon and share it.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Scripts and Receipe

You can find the MGB-5 ASR baseline system here.

Scoring Scripts

For speech-to-text scoring, sclite will be used. There is an open-source Global Language Mapping (GLM) file to be used in evaluation. For any issue you wish to raise or share with others, please go to the issue page on github to contribute. Multi-Reference Word Error Rate (MR-WER) will be considered to leverage the multiple annotation for evaluating dialectal ASR for language with no orthographic rules. We also welcome ideas for ASR Multi reference evaluation.
For Arabic dialect identifcation task; preciosn and recall will reported and the overal accuracy will be used for the final evaluation .

Organizers

Ahmed Ali, Younes Samih, Ahmed Abdel Ali, Hamdy Mubarak (Qatar Computing Research Institute)
Suwon Shon, James Glass (MIT)
Steve Renals, Peter Bell (University of Edinburgh)
Khalid Choukri (ELDA)