This page describes the evaluation conditions for the 2015 MGB Challenge for English TV. It will be updated for the 2017 challenge.

Data

In-domain audio and text data for training acoustic and language models will be provided to each participant in the challenge. Subject to the BBC's discretion, participants will be granted a license to use the data for non-commercial purposes following completion of the challenge.

Data provided includes:

  • Approximately 1,600 hours of broadcast audio taken from seven weeks of BBC output across all TV channels
  • Captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures.
  • Several hundred million words of subtitle text from BBC TV output collected over a 15 year period.
  • A hand-compiled British English lexicon derived from Combilex
  • New for 2016 Challenge 10 hours of training data manually annotated with unique speaker labels, intended as seed data for training i-vector systems, for example

For each evaluation condition, a hand-transcribed development set will be provided, containing 8-15 hours of speech, along with scoring scripts, and, for the speech-to-text transcription task, a Kaldi recipe. In addition to the full seven-week acoustic model training set, a smaller one-week set is defined for quick turnaround of experiments.

The data is available to challenge participants only and subject to to the terms of a licence agreement with the BBC.

Evaluation tasks

Particpants to the challenge can enter any of the four tasks:

  1. Speech-to-text transcription of broadcast television
  2. Alignment of broadcast audio to a subtitle file, ie. lightly supervised alignment
  3. Longitudinal speech-to-text transcription of a sequence of episodes from the same series of shows
  4. Longitudinal speaker diarization and linking, requiring the identification of common speakers across multiple recordings
Tasks are described in more detail below. Each task has one or more primary evaluation conditions and possibly a number of contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Scoring tools for all tasks will be available shortly.

Rules for all tasks
  • Only audio data (train.full) and language model data (mgb.stripped.lm, mgb.normalised.lm) supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.

  • Any lexicon can be used.
Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at word level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.

For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, use the use of metadata for the development set to infer properties of shows with the same title in the evaluation data.

There may be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each show in the evaluation set should be processed independently, ie. it will not be possible to link speakers across shows.

Systems for speech/silence segmentation must be trained only on the official training set. A baseline speech/silence segmentation and speaker clustering for the evaluation data will be supplied for participants who do not wish to build these systems. Any speaker clutering supplied will not link speakers between training/dev/eval sets.

Alignment

In this task, participants will be supplied with a tokenised version of the subitles as they originally appeared on TV (this appears in the XML as transcript_orig in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly of exactly the words in the subtitles, with word timings added.

It should be noted that TV captioning often differs from the actual spoken words for a variety of reasons: there may be edits to enhance clarity, paraphrasing, and deletions where the speech is too fast. There may be words in the captions not appearing in the reference. It will be possible for participants to indicate words have been identified as not actually appearing. This won't affect the scoring explicitly as scoring will use precision/recall of words actually present in the reference.

As in the transcription task, it will be possible to make use of the show title and genre labels, and any automatic speaker labelling across shows that participants choose to generate. What about speaker change markers in the subtitles? Speaker change information will be supplied.

Longitudinal transcription

This task aims to evaluate ASR in a realistic longitudinal setting – processing complete TV series, where the output from shows broadcast earlier may be used to adapt and enhance the performance of later shows. The evaluation data will consist of a collection of TV series with title and genre labels.

Initial models should be trained on the same data as for the standard transcription task. Systems must process each series in strict broadcast order, producing output for each show using only the initial models, and optionally, adaptation data from shows that have gone before. Transcriptions for all shows in the series should be submitted.

Speaker diarization and linking

This task evaluates speaker diarization in a longitudinal setting, across multiple shows from the same series. Systems should aim to label speakers uniquely across the whole series. Speaker labels for each show should be obtained using only material from the show in question, and those broadcast earlier in time.

We propose the participants should not be able to external sources of training data in their diarization systems (for example, for building i-vector extractors). However, this is open for consultation, so get in touch with if you have an opinion on this question.

Scoring - please refer to this document for definition of scoring setup for Diarization and this zip contains v1.0 of the diarization scoring script.