Kaldi recipe

A Kaldi recipe recipe.zip to build a baseline English ASR system on the official training data is available to all registered participants. The recipe is currently in beta and simply trains a basic speaker-adapted GMM system and n-gram LM on the official training data. The recipe will be extended in due course to give a competitive baseline system using sequence-trained hybrid DNNs.

For Arabic, the following GitHub Repoistory has a Kaldi recipe for sequence-trained DNN. The recipe is using 250 hours for AM training, and the corresponding text for LM training, The baseline results were reported on the 10 hours verbatim transcribed development set: 34% (8.5 hours) for the non-overlap speech and 73% (1.5 hours) for the overlap speech.

Both recipes require Kaldi, SRILM and xmlstarlet. The IRSTLM toolkit is also needed, but can be downloaded as part of the Kaldi build process.

Familiarity with other Kaldi recipes is required. In its current form, the recipe does not include the steps needed to perform decoding with the trained models (this will be added when the scoring scripts are released). In making the recipe available now without the decoding step, our primary aim is to simplify the tasks involved in building a Kaldi setup for the MGB Challenge system with correctly-formatted segmentations, transcriptions, text data, lexicon etc — we assume that users will already know how to run decoding with Kaldi models.

Instructions

The recipes assume that you have downloaded the audio files, the corresponding lexicon, XML metadata and language model training text. See the download page for details.

Unpack the recipe.zip file. In the root directory containing run.sh, add links to the Kaldi utils/ and steps/ directories and create copies of your usual path.sh and cmd.sh scripts. You need to copy the list of data train.full to this directory also.

Edit the run.sh script to set paths to various tools and to specify the location of directories containing the WAV files, XML files, language model training text and Combilex lexicon. Running this script should create a GMM system tri4 with LDA+MLLT+SAT, trained on a subset of the full data selected according to lowest Matching Error Rate (MER).

The recipe comes with no guarantees, but feel free to email with any comments or suggestions.