You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license agreement which must already have been signed and returned to the MGB-challenge team. See the registration page for details of how to receive an agreement form.
Unzip the download script and use it like this:
sh download.sh <your_username> <your_password> train18.list|dev18.list <your_local directory> wav
Audio format is
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz(~108MB per hour)
Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.
By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.
I have read and agree to the terms of the Combilex research license.
You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.
all XML files contain at least three versions of the transcript:
XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:
for file in `cat dev.short` ; do <some action on $file.xml> ; doneAll the data subsets are part of the download script zip.
xml sel -t -m "//segments[@annotation_id='transcript_orig']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',text())" 20080522_190000_bbctwo_jonathan_meades_magnetic_north.xml
xml sel -t -m "//segments[@annotation_id='transcript_align']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',@starttime,'|',@endtime,'|',@PMER,'|',@WMER,'|',@AWD,'|')" -m "element" -v "concat(text(),' ')" 20080505_183000_bbcfour_the_sky_at_night.xml
These guidelines were used for manual transcription of the development and evaluation sets.