Background

The MGB Challenge Arabic track is new for the 2016 evaluation.

The Arabic audio data is taken from Aljazeera TV, from the period 2005-2015. All the programs have been manually captions. The total amount of the released data is about 1,200 hours.

All programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented. This data is split into a development set of 10 hours that is available now, and a similar evaluation set of 10 hours from which audio data will be released during the evaluation period.

Data

Data provided includes:
  • Approximately 1,200 hours of Arabic broadcast data, obtained from about 4,000 programmes broadcast on Aljazeera Arabic TV channel over a span of 10 years, from 2005 until September 2015.
  • Time-aligned transcription as an output from light supervised alignment, with a varying quality of human transcription for the whole episode.
  • More than 110 million words of Aljazeera.net website collected between 2004, and the year of 2011

Metadata for each program include title, genre tag, and date/time of transmission. The original set of data for this period contained about 1,500 hours of audio, obtained from all shows; we have removed programmes with damaged aligned transcriptions. the aligned segmented transcription will be shared as well as the original raw transcrption (which has no time information).

Data: Description of the provided data

For each program, we will share the following:

  • The original raw transcription from Aljazeera as it shown on the Aljazeera website. The Arabic text in each file will be in UTF8 encoding.
  • XML including time information for each segments, as well as title, genre tag, and date/time of transmission about the program in Buckwalter transliteration format.

A sample of a training audio file from the training data is available. There is also a corresponding raw transcription and an aligned segmented transcription

Evaluation tasks

Particpants can can enter any of two tasks:

  1. Speech-to-text transcription of broadcast television
  2. Alignment of broadcast audio to a subtitle file, ie. lightly supervised alignment, at the level of a whole show.
Tasks are described in more detail below. Each task has one or more primary evaluation conditions and possibly a number of contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Scoring tools for all tasks will be available on Github repository.

Rules for all tasks
  • Only audio data and language model data supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.

  • Any lexicon can be used.
Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap speech will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at segment level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.

For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, the usage of metadata for the development set to infer properties of shows with the same title in the evaluation data.

There will be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each program in the evaluation set should be processed independently.

Alignment

In this task, participants will be supplied with a tokenised version of speech segments as they have been manually reviewed (this appears in the XML as developmentData in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly for the words in the speech segments, with word timings added.