Download Instructions – English data

You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license agreement which must already have been signed and returned to the MGB-challenge team. See the registration page for details of how to receive an agreement form.


Unzip the download script and use it like this:

sh <your_username> <your_password> train17a.list|dev17a.list <your_local directory> wav

Audio format is

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
(~108MB per hour)

Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.

British English Lexicon

By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.

I have read and agree to the terms of the Combilex research license.

XML Metadata

You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.

Language Model


Content of XML files

all XML files contain at least three versions of the transcript:

Dev and eval sets will also contain a fourth transcript: The dev set alone will contain a fifth version:

Extracting information from XML files

XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:

for file in `cat dev.short` ; do <some action on $file.xml> ; done
All the data subsets are part of the download script zip.

Manual Transcription Guidelines

These guidelines were used for manual transcription of the development and evaluation sets.

Scoring Scripts