You will not be able to download any data without first having registered and received instructions by email. Any use of this data is bound by the MGB-Challenge data license which must already have been signed and returned to the MGB-challenge team. See the registration page.
Unzip the download script and use it like this:
sh download.sh <your_username> <your_password> train.full|train.short|dev.full|dev.short|dev.longitudinal <your_local directory> wav
Audio format is
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz(~108MB per hour)
Audio downloads should include checksum files. If that's not the case, here is a set of checksum files for the diffrent data subsets.
By downloading this Combilex British English Lexicon, you are agreeing to the terms of a limited research license for the MGB Challenge, which may be viewed here.
I have read and agree to the terms of the Combilex research license.
You can download the metadata in the same way as audio by replacing wav with xml in the command above. Below is a set of zipped archives that will be faster to download.
To extract the speaker colours and captions for alignment in Task 2, you can use this XMLStarlet example command:
xml sel -t -m "//segments[@annotation_id='transcript_orig']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',text())" 20080520_200000_bbcone_holby_city.xmlNote this is very slightly different to the command used to extract the equivalent data from the development set.
You can download the original BBC subtitle files (XML format) by replacing wav with bbc.xml in the command above. Note that transcript_orig in the XML metadata files is a normalized version of the BBC subtitles.
MGB_metadata_description.pdf describes the metadata provided for the transcription task of the MGB challenge.
MGB_showinfo.xlsx has stats on each of the training and dev set programmes. Below are summary stats for data sets; .short data sets are strict subsets of .full sets:
|Data set||Programmes||Total duration(h)||Aligned speech(h)||Aligned segments||Words|
|eval.task3||19||14||Tasks 3 and 4 have identical eval sets|
all XML files contain at least three versions of the transcript:
XML files conform to this schema and can be used with any XML-aware software. The examples here use XMLStarlet. These examples will work on a single XML file but you can automate for all files in a particular data set using a command like:
for file in `cat dev.short` ; do <some action on $file.xml> ; doneAll the data subsets are part of the download script zip.
xml sel -t -m "//segments[@annotation_id='transcript_align']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',@starttime,'|',@endtime,'|',@PMER,'|',@WMER,'|',@AWD,'|')" -m "element" -v "concat(text(),' ')" 20080505_180000_bbcfour_the_book_quiz.xml
xml sel -t -m "//segments[@annotation_id='segmentation_only']" -m "segment" -n -v "concat(//speaker[@id=current()/@who]/@name,'|',@starttime,'|',@endtime)" 20080505_180000_bbcfour_the_book_quiz.xml
xml sel -t -m "//segments[@annotation_id='transcript_human']" -m "segment" -n -v "concat(@who,'|',@starttime,'|',@endtime,'|')" -m "element" -v "concat(text(),' ')" 20080505_180000_bbcfour_the_book_quiz.xml
These guidelines were used for manual transcription of the development and evaluation sets.
A zip of all scoring references is here. Or else you can download the inedividual files you need below.
10/12/2015 - 10 hours of data from the train.short set has been manually annotated with accurate speaker IDs. Speaker IDs are unique across shows. Where more than two voices are heard simultaneously and can not be distinguished, the associated speaker element will have an attribute multiple equal to true. Where two speakers overlap the individual speakers will always be distinguished. Extract as RTTM using: