The challenge

The second edition of the Multi-Genre Broadcast (MGB-2) Challenge is an evaluation of speech recognition and lightly supervised alignment using TV recordings in Arabic.

The speech data is broad and multi-genre, spanning the whole range of TV output, and represents a challenging task for speech technology.

In 2016, the challenge featured two new Arabic tracks based on TV data from Aljazeera. It was an official challenge at the 2016 IEEE Workshop on Spoken Language Technology.

Background

The 1,200 hours MGB-2: from Aljazeera TV programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented. This data is split into a development set of 10 hours, and a similar evaluation set of 10 hours. Both the development and evaluation data have been released in the 2016 MGB challenge.

Data

Data provided includes:

Metadata for each program include title, genre tag, and date/time of transmission. The original set of data for this period contained about 1,500 hours of audio, obtained from all shows; we have removed programmes with damaged aligned transcriptions. the aligned segmented transcription will be shared as well as the original raw transcrption (which has no time information).

Data: Description of the provided data

For each program, we will share the following:

A sample of a training audio file from the training data is available. There is also a corresponding raw transcription and an aligned segmented transcription

Evaluation tasks

The MGB-2 challenge has one task: Speech-to-text transcription of broadcast data Tasks are described in more detail below. This task has one primary evaluation conditions and possibly a number of contrastive conditions.

Scoring tools for the Speech-to-text task are available on Github repository.

Rules for all tasks
Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap speech will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at segment level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis. In the MGB-3 competition, we will share the multiple reference word error rate (MR-WER) to explore the non-orthographic aspect in dialectal Arabic for scoring.

For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, and some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, the usage of metadata for the development set to infer properties of shows with the same title in the evaluation data.

There will be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each program in the evaluation set should be processed independently.

Download Instructions

You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the QCRI-Aljazeera data license agreement which must have been signed. See the ArabicSpeech MGB-2 page for details of how to download the data.

You will need to register first to be able to download the MGB-2 corpus.

Arabic Pronunciation Dictionary

We suggest using a grapheme-based lexicon for this challenge.

If you prefer to use phoneme-based lexicon, you can get it from the QCRI web portal here.

Scripts and Receipe

A Github repository access for software package that enables research groups not familiar with Arabic ASR to get started quickly. The repository comes with no guarantees or responsibility , but feel free to email us or, ask to have write access to the repository.

Arabic Kaldi receipe can be accessible on Kaldi website. It has simialr architecture to the MGB-2 best system. However, the shared code is using gale_arabic data. This should be easy to use the shared receipe to reproduce last year best results.

The following recipe reflects the JHU system for the MGB-2 data.

Scoring Scripts

For speech-to-text scoring, sclite will be used. There is an open-source Global Language Mapping (GLM) file to be used in evaluation.

Organizers