Some notes on Kaldi

Data Preparation

The official kaldi documentation on this section. It is the basis of a lot of this section.

These steps are carried out by the script local/tidigits_data_prep.sh. It takes one parameter – the path to the dataset.

One should realize after looking at this section (and the next), just how valuable AWK and Bash (or equivalents) are for this task.

Locate the Dataset

on on the SIP network, the TIDIGITs data set can be found at /user/data14/res/speech_data/TIDIGITs/. Symlink it into a convenient location.

Split the Dataset into test and training

TIDIGITS is already split into test and training datasets. If it were not, you would need to do the split. It could be done at any time during the data preparation step, depending on when other useful informations (from the annotations), is available.

Parse its annotations

Annotations of the correct labels for each utterance need to be generated for the test and training directories.

Kaldi Script: .scp: Basically just a list of Utterances to Filenames

A Kaldi script file is just a mapping from record_id, to extended-filenames.

Line Format:

<recording_id> <extended_filename>

Recording ID

The recording ID is the first part of each line in a .scp file. If speaker id is available (which is is for TIDIGITs), it should form the first part of the recording id. Kaldi requires this not for speaker identification, but for purposes of sorting for training (utt2spk is for that).

The remained of the Speaker ID is arbitary, so long as it is unique. For convenience of generating the unique id, the example script for TIDIGITS uses <speaker-id>_<transcription><sessionid>.

As there is only one Utterance per recording in TIDIGITS, the Recording ID is the Utterance ID. (See below)

Extended Filename

The second part of the line is the extended filename Extended Filename is the term used by Kaldi, to refer to a string that is either the path to a wav-format file or it is a bash command that will output wav-format data to standard out, followed by a pipe symbol (|).

As the TIDIGITS data is in the SPHERE audio format, it needs to be converted to wav. So the sample scripts in Kaldi use sph2pipe to convert them, so the .scp files lines will look like: (assuming sph2pipe is on your PATH, otherwise Path to the executable will need to be used)

ad_16a sph2pipe -f wav ../TIDIGITs/test/girl/ad/16a.wav |

Segmentation File segments

If there were multiple utterances per recording then there would need to be a segmentation file as well, mapping Recording Ids and Start-End times to Utterance IDs. (See The official kaldi documentation on this section). As there is not, by not creating a segments file, Kaldi defaults to utterance id == recording id.

Text Transcription file text

The text transcription must be stored in a file, which the example calls text. Each line is an utterance-id followed by a transcription of what is said. E.g.:

ad_1oa 1 o
ad_1z1za 1 z 1 z
ad_1z6a 1 z 6
ad_23461a 2 3 4 6 1

Notice the Utterance-ID format as described above. Notice also, for later, that the transcription here is in word space, not phoneme space.

Utterance to Speaker Mappings utt2spk

This file maps each utterance id to a speaker id. Each line has the form <utterance id> <speaker-id>.

spk2utt is the opposite, and can be generated by using the script utils/utt2spk_to_spk2utt.pl. Each like starts with a speaker id, then has every utterance id they spoke.

Feature extraction

The feature extraction is carried out by the run.sh script, rather than by the local/tidigits_data_prep.sh script.

Extracting the MFCC Features

See this section of the kaldi tutorial

Mel-frequency cepstral coefficient (MFCCs) features. Done using the script steps/make_mfcc.sh

Compute Cepstral Mean and Variance Normalization statistics

Done using the script steps/compute_cmvn_stats.sh

Data Splitting.

The data needs to be divided up so that we can run many jobs in parallel. The data splitting is also carried out by the steps/train_mono.sh and steps/decode.sh scripts if it has not already been carried out, rather than by the local/tidigits_data_prep.sh script. It can however be carried out at anytime after the training and test directories are created, and features extracted.

It can be done with the script utils/split_data.sh. Usage:

utils/split_data.sh <data-dir> <num-splits>
  • <data-dir> is the directory where the data is. In this case it would be both of data/test and data/train
  • <num-splits> is the number of divisions of data needed. It should be the number of different Jobs.