The challenge

The first edition of the Multi-Genre Broadcast (MGB-1) Challenge is an evaluation of speech recognition, speaker diarization, and lightly supervised alignment using TV recordings in English.

The speech data is broad and multi-genre, spanning the whole range of TV output, and represents a challenging task for speech technology.

In 2015, the challenge used data from the British Broadcasting Corporation (BBC). It was an official challenge of the 2015 IEEE Automatic Speech Recognition and Understanding Workshop.

Data

In-domain audio and text data for training acoustic and language models will be provided to each participant in the challenge. Subject to the BBC's discretion, participants will be granted a license to use the data for non-commercial purposes following completion of the challenge.

Data provided includes:

  • Approximately 500 hours of broadcast audio taken from seven weeks of BBC output across all TV channels
  • Captions as originally broadcast on TV, accompanied by baseline lightly-supervised alignments using an ASR system, with confidence measures.
  • Several hundred million words of subtitle text from BBC TV output collected over a 15 year period.
  • A hand-compiled British English lexicon derived from Combilex

For each evaluation condition, a hand-transcribed development set will be provided, containing 8-15 hours of speech, along with scoring scripts, and, for the speech-to-text transcription task, a Kaldi recipe. In addition to the full seven-week acoustic model training set, a smaller one-week set is defined for quick turnaround of experiments.

The data is available to challenge participants only and subject to to the terms of a licence agreement with the BBC.

Evaluation tasks

Particpants to the challenge can enter any of the four tasks:

  1. Speech-to-text transcription of broadcast television
  2. Alignment of broadcast audio to a subtitle file, ie. lightly supervised alignment
Tasks are described in more detail below. Each task has one or more primary evaluation conditions and possibly a number of contrastive conditions. To enter a task, participants must submit at least one system which fulfils the primary evaluation conditions. Note that signing the MGB challenge data license requires you to participate in at least one task.

Scoring tools for all tasks will be available shortly.

Rules for all tasks
  • Only audio data (train.full) and language model data (mgb.stripped.lm, mgb.normalised.lm) supplied by the organisers can be used for transcription and alignment tasks. All metadata supplied with training data can be used.

  • Any lexicon can be used.
Transcription

This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at word level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis.

For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, use the use of metadata for the development set to infer properties of shows with the same title in the evaluation data.

There may be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each show in the evaluation set should be processed independently, ie. it will not be possible to link speakers across shows.

Systems for speech/silence segmentation must be trained only on the official training set. A baseline speech/silence segmentation and speaker clustering for the evaluation data will be supplied for participants who do not wish to build these systems. Any speaker clutering supplied will not link speakers between training/dev/eval sets.

Alignment

In this task, participants will be supplied with a tokenised version of the subitles as they originally appeared on TV (this appears in the XML as transcript_orig in the metadata download). The task is to align these to the spoken audio at word level, where possible. Scoring will be performed by a script (supplied to participants shortly) that calculates a precision/recall measure for each spoken word, derived from a careful manual transcription. Participants should supply output exactly of exactly the words in the subtitles, with word timings added.

It should be noted that TV captioning often differs from the actual spoken words for a variety of reasons: there may be edits to enhance clarity, paraphrasing, and deletions where the speech is too fast. There may be words in the captions not appearing in the reference. It will be possible for participants to indicate words have been identified as not actually appearing. This won't affect the scoring explicitly as scoring will use precision/recall of words actually present in the reference.

As in the transcription task, it will be possible to make use of the show title and genre labels, and any automatic speaker labelling across shows that participants choose to generate. What about speaker change markers in the subtitles? Speaker change information will be supplied.

Download Instructions

You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license agreement which must already have been signed and returned to the MGB-challenge team. See the registration page for details of how to receive an agreement form.

Audio

Unzip the download script and use it like this:

sh download.sh <your_username> <your_password> train18.list|dev18.list <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
(~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.

British English Lexicon

By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.

I have read and agree to the terms of the Combilex research license.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.

Language Model

Content of XML files

all XML files contain at least three versions of the transcript:

  • transcript_orig is a normalized (tokenized) version original transcription provided by the bbc including colour speaker change
  • transcript_align is the lightly supervised alignment with associated PMER(phone level MER), WMER(word level MER), WAD(Word Average Duration)
  • transcript_lsdecode is the output of the lightly supervised decoding with associated confidence scores (CS)
Dev and eval sets will also contain a fourth transcript:
  • transcript_human: the manual transcript. This is currently not normalized and will contain characters as per the transcription guidelines. Scoring will not be done against this version; a normalized version will be provided for that purpose.
The dev set alone will contain a fifth version:
  • segmentation_only: a baseline segmentation for experiments

Extracting information from XML files

XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:

for file in `cat dev.short` ; do <some action on $file.xml> ; done
All the data subsets are part of the download script zip.

  • Extract the speaker colours and captions from the original BBC subtitles:
     xml sel -t -m "//segments[@annotation_id='transcript_orig']" -m "segment"
    -n -v "concat(//speaker[@id=current()/@who]/@name,'|',text())" 20080522_190000_bbctwo_jonathan_meades_magnetic_north.xml
  • Find PMER(phone level MER), WMER(word level MER), AWD(Average Word Duration) for each segment of the lightly supervised alignment - the first element in the concat clause finds the original speaker name instead of generated ID.
    xml sel -t -m "//segments[@annotation_id='transcript_align']" -m "segment" -n
       -v "concat(//speaker[@id=current()/@who]/@name,'|',@starttime,'|',@endtime,'|',@PMER,'|',@WMER,'|',@AWD,'|')" 
       -m "element" -v "concat(text(),' ')" 20080505_183000_bbcfour_the_sky_at_night.xml

Manual Transcription Guidelines

These guidelines were used for manual transcription of the development and evaluation sets.

Scoring Scripts

Organizers

  • P Bell, J Kilgour, M Wester, S Renals (University of Edinburgh)
  • P Lanchantin, X Liu, MJF Gales, PC Woodland (Cambridge University)
  • O Saz, T Hain (University of Sheffield)
  • A McParland (BBC R&D)