The Arabic track for the 2017 multi-dialect multi-genre evaluation (MGB-3) is an extension for the 2016 evaluation (MGB-2).
In addition to the 1,200 hours used in 2016 from Aljazeera TV programs, this year will explore multi-genre data; comedy, cooking, cultural, environment, family-kids, fashion, movies-drama, sports and science talks (TEDX).
MGB-3: This year we are using 16 hours multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for this year is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guidelines. The MGB-3 data is split into three groups; adaptation, development and evaluation data which will be shared at the evaluation time as shown in the dates.
MGB-2: The 1,200 hours from Aljazeera TV programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented. This data is split into a development set of 10 hours, and a similar evaluation set of 10 hours. Both the development and evaluation data have been released in the 2016 MGB challenge. The same evaluation set will be used this year.
Metadata for each program include title, genre tag, and date/time of transmission. The original set of data for this period contained about 1,500 hours of audio, obtained from all shows; we have removed programmes with damaged aligned transcriptions. the aligned segmented transcription will be shared as well as the original raw transcrption (which has no time information).
For each program, we will share the following:
Egyptian broadcast data collected from YouTube.This year, we collected about 80 programs from different YouTube channels. The first 12 minutes from each program has been transcribed and released. This sums up to roughly 16 hours in total divided as follow:
All programs have been transcribed by four different annotators to explore the non-orthographic nature of the dialectal Arabic.
For each program, we will share the following:
Particpants can enter any of two tasks:
Scoring tools for all tasks will be available on Github repository. We will release the multi-reference word error rate (MR-WER) code to evaluate the MGB-3 using multiple transcriptions.
This is a standard speech transcription task operating on a collection of whole TV shows drawn from diverse genres. Scoring will require ASR output with word-level timings. Segments with overlap speech will be ignored for scoring purposes (where overlap is defined to minimise the regions removed - at segment level where possible). Speaker labels are not required in the hypothesis for scoring. Usual NIST-style mappings will be used to normalise reference/hypothesis. In the MGB-3 competition, we will share the multiple reference word error rate (MR-WER) to explore the non-orthographic aspect in dialectal Arabic for scoring.
MGB-2: For the evaluation data, show titles and genre labels will be supplied. Some titles will have appeared in the training data, and some will be new. All genre labels will have been seen in the training data. The supplied title and genre information can be used as much as desired. Other metadata present in the development data will not be supplied for the evaluation data, but this does not preclude, for example, the usage of metadata for the development set to infer properties of shows with the same title in the evaluation data.
There will be shared speakers across training and evaluation data. It is possible for participants to automatically identify these themselves and make use of the information. However, each program in the evaluation set should be processed independently.
MGB3: This year, we are releasing 5 hours for adaption and 5 hours for development to explore using them to get better results on dialectal data such as Egyptian comedy. We assume the MGB-3 data is not enough by itself to build robust Arabic Speech recognition system, but could be quite useful for adaptation, and hyper-parameter tuning for models built using the MGB-2 data.
In this task, participants will be supplied with more than 50 hours labeled for each dialect. This will be divided across the five major Arabic dialects; Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). Participants are encouraged to use the 10 hours per dialect to label more data from both the MGB-2 and MGB-3 data. Dialectal data and baseline code will be shared on QCRI dialect ID Github . The overal accuracy will be used for the evaluation criteria across the five dialects. The test data which will be shared at the evaluation time as shown in the dates section. Participants should specify one dialect for each audio file.