Download Instructions – English data

You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.

Audio

Unzip the download script and use it like this:

sh download.sh <your_username> <your_password> train.full|train.short|dev.full|dev.short|dev.longitudinal <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
(~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.

British English Lexicon

By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.

I have read and agree to the terms of the Combilex research license.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.

Evaluation metadata

To extract the speaker colours and captions for alignment in Task 2, you can use this XMLStarlet example command:

xml sel -t -m "//segments[@annotation_id='transcript_orig']" -m "segment"
-n -v "concat(//speaker[@id=current()/@who]/@name,'|',text())" 20080520_200000_bbcone_holby_city.xml
Note this is very slightly different to the command used to extract the equivalent data from the development set.

Language Model

Original BBC XML Subtitle files

You can download the original BBC subtitle files (XML format) by replacing wav with bbc.xml in the command above. Note that transcript_orig in the XML metadata files is a normalized version of the BBC subtitles.

 

Further Information

Description of MGB metadata

MGB_metadata_description.pdf describes the metadata provided for the transcription task of the MGB challenge.

Description of data sets

MGB_showinfo.xlsx has stats on each of the training and dev set programmes. Below are summary stats for data sets; .short data sets are strict subsets of .full sets:

Data setProgrammesTotal duration(h)Aligned speech(h)Aligned segmentsWords
train.full21931580119763582710566560
dev.full47282013165183811
train.short274199152810271373913
dev.short1286358351466
dev.longitudinal19128.5596272884
eval.task11611
eval.task31914Tasks 3 and 4 have identical eval sets
eval.task41914

Content of XML files

all XML files contain at least three versions of the transcript:

Dev and eval sets will also contain a fourth transcript: The dev set alone will contain a fifth version:

Extracting information from XML files

XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:

for file in `cat dev.short` ; do <some action on $file.xml> ; done
All the data subsets are part of the download script zip.

Manual Transcription Guidelines

These guidelines were used for manual transcription of the development and evaluation sets.

Scoring Scripts

Scoring References

A zip of all scoring references is here. Or else you can download the inedividual files you need below.

Further Manual Annotation

Speaker ID

10/12/2015 - 10 hours of data from the train.short set has been manually annotated with accurate speaker IDs. Speaker IDs are unique across shows. Where more than two voices are heard simultaneously and can not be distinguished, the associated speaker element will have an attribute multiple equal to true. Where two speakers overlap the individual speakers will always be distinguished. Extract as RTTM using: