MGB Challenge

The challenge

The third edition of the Multi-Genre Broadcast (MGB-3 speech recognition challenge in the wild) is an evaluation of speech recognition and five Arabic dialect identification using youtube recordings in dialectal Arabic.

The MGB-3 is using 16 hours multi-genre data collected from different YouTube channels.

In 2017, the challenge featured two new Arabic tracks based on TV data from Aljazeera as well as youtube recordings. It was an official challenge at the 2017 IEEE Automatic Speech Recognition and Understanding Workshop.

Background

The Arabic track for the 2017 multi-dialect multi-genre evaluation (speech recognition in the wild) is an extension for the 2016 evaluation (MGB-2).

In addition to the 1,200 hours used in 2016 from Aljazeera TV programs, the MGB-3 explores multi-genre data; comedy, cooking, cultural, environment, family-kids, fashion, movies-drama, sports and science talks (TEDX).

The MGB-3 is using 16 hours multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for this year is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guidelines. The MGB-3 data is split into three groups; adaptation, development and evaluation data.

Data

Egyptian broadcast data collected from YouTube.

This year, we collected about 80 programs from different YouTube channels. The first 12 minutes from each program has been transcribed and released. This sums up to roughly 16 hours in total divided as follow:

Adaptation: 12 minutes * 24 programs.
Development: 12 minutes * 24 programs .
Evaluation: 12 minutes * 31 programs

All programs have been transcribed by four different annotators to explore the non-orthographic nature of the dialectal Arabic.

Data: Description of the provided data

For each program, we will share the following:

The original raw transcription for the four annotators. The Arabic text in each file will be in UTF8 encoding.
We will also share both the segments and the text file for the transcription in Buckwalter transliteration format. We follow the standard kaldi data format.

A sample of a audio file from the daptation data is available. There is also a corresponding raw UTF-8 transcription, and buckwalter transcription. Finally the segments information.

Evaluation tasks

Particpants can enter any of two tasks:

Speech-to-text transcription of broadcast data
Arabic Dialect Identification of Arabic audio. For this task, we are releasing 10 hours per dialect. We provide data for five Arabic dialects: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA).The data comes from broadcast news.

Tasks are described in more detail below. Each task has one or more primary evaluation conditions and possibly a number of contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Scoring tools for all tasks will be available on Github repository. We will release the multi-reference word error rate (MR-WER) code to evaluate the MGB-3 using multiple transcriptions.

Rules for all tasks

Only audio data and language model data supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.
Any lexicon can be used.

Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap speech will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at segment level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis. In the MGB-3 competition, we will share the multiple reference word error rate (MR-WER) to explore the non-orthographic aspect in dialectal Arabic for scoring.

In the MGB3, we released 5 hours for adaption and 5 hours for development to explore using them to get better results on dialectal data such as Egyptian comedy. We assume the MGB-3 data is not enough by itself to build robust Arabic Speech recognition system, but could be quite useful for adaptation, and hyper-parameter tuning for models built using the MGB-2 data.

Manual Transcription Guidelines for Development and Evaluation data

Split on silence according to the shape of the wave
Segment duration should be more than 3 sec.
Verbatim transcription
Mark hesitation and repetition with # (like: today is# is )
Mark segments with overlap speech by "###" in the beginning

Arabic Dialect Identification

ADI-5: In this task, participants will be supplied with more than 50 hours labeled for each dialect. This will be divided across the five major Arabic dialects; Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). Participants are encouraged to use the 10 hours per dialect to label more data from both the MGB-2 and MGB-3 data. Dialectal data and baseline code will be shared on QCRI dialect ID Github . The overal accuracy will be used for the evaluation criteria across the five dialects. The test data which will be shared at the evaluation time as shown in the dates section. Participants should specify one dialect for each audio file.

Download Instructions

Dialectal identifcation data can be found in this repository.
speech-to-text data can be found in this link.

You will need to register first to be able to download the MGB-2 corpus.

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

The grapheme-based lexicon is 1:1 word to grapheme mapping. You can find the latest version of the lexicon in GitHub repository.
Participants are encouraged to improve the lexicon and share it.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Scripts and Receipe

The following recipe reflects the JHU system for the MGB-2 data.

Scoring Scripts

For speech-to-text scoring, sclite will be used. There is an open-source Global Language Mapping (GLM) file to be used in evaluation. For any issue you wish to raise or share with others, please go to the issue page on github to contribute. Multi-Reference Word Error Rate (MR-WER) will be considered to leverage the multiple annotation for evaluating dialectal ASR for language with no orthographic rules. We also welcome ideas for ASR Multi reference evaluation.
For Arabic dialect identifcation task; preciosn and recall will reported and the overal accuracy will be used for the final evaluation .

Organizers

Ahmed Ali, Stephan Vogel (Qatar Computing Research Institute)
Steve Renals (University of Edinburgh)