This page describes the evaluation conditions for the 2015 MGB Challenge for English TV. It will be updated for the 2017 challenge.
In-domain audio and text data for training acoustic and language models will be provided to each participant in the challenge. Subject to the BBC's discretion, participants will be granted a license to use the data for non-commercial purposes following completion of the challenge.
Data provided includes:
For each evaluation condition, a hand-transcribed development set will be provided, containing 8-15 hours of speech, along with scoring scripts, and, for the speech-to-text transcription task, a Kaldi recipe. In addition to the full seven-week acoustic model training set, a smaller one-week set is defined for quick turnaround of experiments.
The data is available to challenge participants only and subject to to the terms of a licence agreement with the BBC.
Particpants to the challenge can enter any of the four tasks:
Scoring tools for all tasks will be available shortly.
This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at word level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.
For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, use the use of metadata for the development set to infer properties of shows with the same title in the evaluation data.
There may be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each show in the evaluation set should be processed independently, ie. it will not be possible to link speakers across shows.
Systems for speech/silence segmentation must be trained only on the official training set. A baseline speech/silence segmentation and speaker clustering for the evaluation data will be supplied for participants who do not wish to build these systems. Any speaker clutering supplied will not link speakers between training/dev/eval sets.
In this task, participants will be supplied with a tokenised version of the subitles as they originally appeared on TV (this appears in the XML as transcript_orig in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly of exactly the words in the subtitles, with word timings added.
It should be noted that TV captioning often differs from the actual spoken words for a variety of reasons: there may be edits to enhance clarity, paraphrasing, and deletions where the speech is too fast. There may be words in the captions not appearing in the reference. It will be possible for participants to indicate words have been identified as not actually appearing. This won't affect the scoring explicitly as scoring will use precision/recall of words actually present in the reference.
As in the transcription task, it will be possible to make use of the show title and genre labels, and any automatic speaker labelling across shows that participants choose to generate. What about speaker change markers in the subtitles? Speaker change information will be supplied.
This task aims to evaluate ASR in a realistic longitudinal setting – processing complete TV series, where the output from shows broadcast earlier may be used to adapt and enhance the performance of later shows. The evaluation data will consist of a collection of TV series with title and genre labels.
Initial models should be trained on the same data as for the standard transcription task. Systems must process each series in strict broadcast order, producing output for each show using only the initial models, and optionally, adaptation data from shows that have gone before. Transcriptions for all shows in the series should be submitted.
This task evaluates speaker diarization in a longitudinal setting, across multiple shows from the same series. Systems should aim to label speakers uniquely across the whole series. Speaker labels for each show should be obtained using only material from the show in question, and those broadcast earlier in time.
We propose the participants should not be able to external sources of training data in their diarization systems (for example, for building i-vector extractors). However, this is open for consultation, so get in touch with if you have an opinion on this question.