The official kaldi documentation on this section. It is the basis of a lot of this section.
These steps are carried out by the script
local/tidigits_data_prep.sh. It takes one parameter – the path to the dataset.
One should realize after looking at this section (and the next), just how valuable AWK and Bash (or equivalents) are for this task.
Locate the Dataset
on on the SIP network, the TIDIGITs data set can be found at
/user/data14/res/speech_data/TIDIGITs/. Symlink it into a convenient location.
Split the Dataset into test and training
TIDIGITS is already split into test and training datasets. If it were not, you would need to do the split. It could be done at any time during the data preparation step, depending on when other useful informations (from the annotations), is available.
Parse its annotations
Annotations of the correct labels for each utterance need to be generated for the
.scp: Basically just a list of Utterances to Filenames
A Kaldi script file is just a mapping from record_id, to extended-filenames.
The recording ID is the first part of each line in a
If speaker id is available (which is is for TIDIGITs), it should form the first part of the recording id.
Kaldi requires this not for speaker identification, but for purposes of sorting for training (
utt2spk is for that).
The remained of the Speaker ID is arbitary, so long as it is unique.
For convenience of generating the unique id, the example script for TIDIGITS uses
As there is only one Utterance per recording in TIDIGITS, the Recording ID is the Utterance ID. (See below)
The second part of the line is the extended filename
Extended Filename is the term used by Kaldi, to refer to a string that is either the path to a wav-format file or it is a bash command that will output wav-format data to standard out, followed by a pipe symbol (
As the TIDIGITS data is in the SPHERE audio format, it needs to be converted to wav.
So the sample scripts in Kaldi use
sph2pipe to convert them, so the .scp files lines will look like: (assuming
sph2pipe is on your PATH, otherwise Path to the executable will need to be used)
ad_16a sph2pipe -f wav ../TIDIGITs/test/girl/ad/16a.wav |
If there were multiple utterances per recording then there would need to be a segmentation file as well, mapping Recording Ids and Start-End times to Utterance IDs.
(See The official kaldi documentation on this section).
As there is not, by not creating a
segments file, Kaldi defaults to utterance id == recording id.
Text Transcription file
The text transcription must be stored in a file, which the example calls
Each line is an utterance-id followed by a transcription of what is said.
ad_1oa 1 o ad_1z1za 1 z 1 z ad_1z6a 1 z 6 ad_23461a 2 3 4 6 1
Notice the Utterance-ID format as described above. Notice also, for later, that the transcription here is in word space, not phoneme space.
Utterance to Speaker Mappings
This file maps each utterance id to a speaker id.
Each line has the form
<utterance id> <speaker-id>.
spk2utt is the opposite, and can be generated by using the script
Each like starts with a speaker id, then has every utterance id they spoke.
The feature extraction is carried out by the
run.sh script, rather than by the
Extracting the MFCC Features
Mel-frequency cepstral coefficient (MFCCs) features.
Done using the script
Compute Cepstral Mean and Variance Normalization statistics
Done using the script
The data needs to be divided up so that we can run many jobs in parallel.
The data splitting is also carried out by the
steps/decode.sh scripts if it has not already been carried out, rather than by the
local/tidigits_data_prep.sh script. It can however be carried out at anytime after the training and test directories are created, and features extracted.
It can be done with the script
utils/split_data.sh <data-dir> <num-splits>
<data-dir>is the directory where the data is. In this case it would be both of
<num-splits>is the number of divisions of data needed. It should be the number of different Jobs.