MGB Challenge - English tasks

Data

In-domain audio and text data for training acoustic and language models will be provided to each participant in the challenge. Subject to the BBC's discretion, participants will be granted a license to use the data for non-commercial purposes following completion of the challenge.

Data provided includes:

Approximately 500 hours of broadcast audio taken from seven weeks of BBC output across all TV channels
Captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures.
Several hundred million words of subtitle text from BBC TV output collected over a 15 year period.
A hand-compiled British English lexicon derived from Combilex

For each evaluation condition, a hand-transcribed development set will be provided, containing 8-15 hours of speech, along with scoring scripts, and, for the speech-to-text transcription task, a Kaldi recipe. In addition to the full seven-week acoustic model training set, a smaller one-week set is defined for quick turnaround of experiments.

The data is available to challenge participants only and subject to to the terms of a licence agreement with the BBC.

Evaluation tasks

Particpants to the challenge can enter any of the four tasks:

Speech-to-text transcription of broadcast television
Alignment of broadcast audio to a subtitle file, ie. lightly supervised alignment

Tasks are described in more detail below. Each task has one or more primary evaluation conditions and possibly a number of contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Scoring tools for all tasks will be available shortly.

Rules for all tasks

Only audio data (train.full) and language model data (mgb.stripped.lm, mgb.normalised.lm) supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.
Any lexicon can be used.

Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at word level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.

For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, use the use of metadata for the development set to infer properties of shows with the same title in the evaluation data.

There may be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each show in the evaluation set should be processed independently, ie. it will not be possible to link speakers across shows.

Systems for speech/silence segmentation must be trained only on the official training set. A baseline speech/silence segmentation and speaker clustering for the evaluation data will be supplied for participants who do not wish to build these systems. Any speaker clutering supplied will not link speakers between training/dev/eval sets.

Alignment

In this task, participants will be supplied with a tokenised version of the subitles as they originally appeared on TV (this appears in the XML as transcript_orig in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly of exactly the words in the subtitles, with word timings added.

It should be noted that TV captioning often differs from the actual spoken words for a variety of reasons: there may be edits to enhance clarity, paraphrasing, and deletions where the speech is too fast. There may be words in the captions not appearing in the reference. It will be possible for participants to indicate words have been identified as not actually appearing. This won't affect the scoring explicitly as scoring will use precision/recall of words actually present in the reference.

As in the transcription task, it will be possible to make use of the show title and genre labels, and any automatic speaker labelling across shows that participants choose to generate. What about speaker change markers in the subtitles? Speaker change information will be supplied.