The first edition of the Multi-Genre Broadcast (MGB-1) Challenge is an evaluation of speech recognition, speaker diarization, and lightly supervised alignment using TV recordings in English.
The speech data is broad and multi-genre, spanning the whole range of TV output, and represents a challenging task for speech technology.
In 2015, the challenge used data from the British Broadcasting Corporation (BBC). It was an official challenge of the 2015 IEEE Automatic Speech Recognition and Understanding Workshop.
In-domain audio and text data for training acoustic and language models will be provided to each participant in the challenge. Subject to the BBC's discretion, participants will be granted a license to use the data for non-commercial purposes following completion of the challenge.
Data provided includes:
For each evaluation condition, a hand-transcribed development set will be provided, containing 8-15 hours of speech, along with scoring scripts, and, for the speech-to-text transcription task, a Kaldi recipe. In addition to the full seven-week acoustic model training set, a smaller one-week set is defined for quick turnaround of experiments.
The data is available to challenge participants only and subject to to the terms of a licence agreement with the BBC.
Particpants to the challenge can enter any of the four tasks:
Scoring tools for all tasks will be available shortly.
This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at word level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.
For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, use the use of metadata for the development set to infer properties of shows with the same title in the evaluation data.
There may be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each show in the evaluation set should be processed independently, ie. it will not be possible to link speakers across shows.
Systems for speech/silence segmentation must be trained only on the official training set. A baseline speech/silence segmentation and speaker clustering for the evaluation data will be supplied for participants who do not wish to build these systems. Any speaker clutering supplied will not link speakers between training/dev/eval sets.
In this task, participants will be supplied with a tokenised version of the subitles as they originally appeared on TV (this appears in the XML as transcript_orig in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly of exactly the words in the subtitles, with word timings added.
It should be noted that TV captioning often differs from the actual spoken words for a variety of reasons: there may be edits to enhance clarity, paraphrasing, and deletions where the speech is too fast. There may be words in the captions not appearing in the reference. It will be possible for participants to indicate words have been identified as not actually appearing. This won't affect the scoring explicitly as scoring will use precision/recall of words actually present in the reference.
As in the transcription task, it will be possible to make use of the show title and genre labels, and any automatic speaker labelling across shows that participants choose to generate. What about speaker change markers in the subtitles? Speaker change information will be supplied.
You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license agreement which must already have been signed and returned to the MGB-challenge team. See the registration page for details of how to receive an agreement form.
Unzip the download script and use it like this:
sh download.sh <your_username> <your_password> train18.list|dev18.list <your_local directory> wav
Audio format is
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz(~108MB per hour)
Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.
By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.
I have read and agree to the terms of the Combilex research license.
You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.
all XML files contain at least three versions of the transcript:
XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:
for file in `cat dev.short` ; do <some action on $file.xml> ; doneAll the data subsets are part of the download script zip.
xml sel -t -m "//segments[@annotation_id='transcript_orig']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',text())" 20080522_190000_bbctwo_jonathan_meades_magnetic_north.xml
xml sel -t -m "//segments[@annotation_id='transcript_align']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',@starttime,'|',@endtime,'|',@PMER,'|',@WMER,'|',@AWD,'|')" -m "element" -v "concat(text(),' ')" 20080505_183000_bbcfour_the_sky_at_night.xml
These guidelines were used for manual transcription of the development and evaluation sets.