Evaluation – using the model to recognise speech.
Decoding
We already created a decoding graph in the training step.
Using this graph to decode the utterances is done using steps/decode.sh
.
This script only works only for certain feature types – conveniently all the feature types we use in TIDIGITS. (Similar decoding functions also exist in steps, for other feature types)
Usage for steps/decode.sh
Usage:
steps/decode.sh [options] <graph-dir> <test-data-dir> <decode-dir>
test-data-dir
is the path to the training data directory prepared earliergraph-dir
is the path to the directory containing the graphs generated in the previous stepdecode-dir
is a path to store all of its outputs – including the results of the evaluations. It will be created if it does not exist.
Configuration / Options
The decode.sh
script takes many configuration options, these should be familiar from the train_mono.sh
script options above.
They can be set by passing them as flags to script: as so: --<option-name> <value>
.
Or by putting them all into a config bash script, and adding the flag --config <path>
.
They could also be set by editing the defaults in steps/decode.sh
, but there is no good reason to do this.
nj
: Number of Jobs to run in parallel. (default4
)cmd
: Job dispatcher script (defaultrun.pl
)iter
: Iteration of model to test. Training step above actually stores a copy of the model for each iteration. This option can be used to go back and test that (default final trained model). Overridden bymodel
option.model
: which model to use, given by path. If given this overrides theiter
(default determined by value ofiter
)transform-dir
directory path to find fMLLR transforms (Not useful for TIDIGITS). (default: N/A only used if fMLLR transformed were done on features.)scoring-opts
options to local/score.sh. Can be used to set min and max Language Model Weight for rescoring to be done. (default: “”)num-threads
number of threads to use, (default 1).parallel-opts <opts>
option string to be supplied to the parallel executer (in our running locally caseutils/run.pl
- e.g. ‘-pe smp 4’ if you supply
--num-threads 4
- e.g. ‘-pe smp 4’ if you supply
stage
: This is used to allow you to skip some steps, as above. However decode only has 2 stages. If stage is greater than 0 will skip decoding and just do scoring. (default0
)
Options passed on to kaldi-trunk/src/gmmbin/gmm-latgen-faster
:
acwt
acoustic scale applied to acoustic likelihoods applied in lattice generation (default 0.083333). It affects the pruning of the lattice (low enough likelihood will be pruned).max_active
(default 7000)beam
decoding beam (default 13.0)lattice_beam
lattice generation beam (default 6.0)
]##What is the parallelism of Jobs in the Decoding step During decoding, the test set can be (and is in the example) split up (the actual spitting was doing in the data preparation step), and each different process (Job), decodes different subset of utterances, into lattices (see below). When scoring happens (see below), all the different lattices are evaluated to get the transcriptions.
Lattices
From the Kaldi documentation “A lattice is a representation of the alternative word-sequences that are "sufficiently likely” for a particular utterance.“
This blog post gives an introduction to the Latices in Kaldi quiet well, relating them to the other FSTs.
Kaldi creates and uses these latices during the decoding step.
However, interpreting them can be hard, because all the command line programs for working with them use Kaldi’s special table IO, describing how this works in detail is beyond the scope of this introduction.
The command line programs in question can be found in /kaldi-trunk/src/latbibin
The Latices are output during the decoding into <decode-dir>
. Into a numbered gzipped file. E.g. lat.10.gz
. The number corresponds to the Job number (because the data has been distributed to multiple processes). Each contains a single binary file.
Each of these achieves contains many latices - one for each utterance.
Commands to work with them take the general form of:
<lattice-command> [options] "ark:gunzip -c <path to lat.N.gz>|" ark,t:<outputpath>
Each of the lattice commands do take the --help
option which will cause them to give the other options.
Example Converting a lattice to a FST Diagram
For example,
Consider the lattice gzipped at exp/mono0a/decode/lat.1.gz
Running:
lattice-to-fst "ark:gunzip -c exp/mono0a/decode/lat.1.gz|" ark,t:1.fsts
utils/int2sym.pl -f 3 data/lang/words.txt 1.fsts > 1.txt.fsts
(Assuming that /kaldi-trunk/src/latbin
is in your path)
Will fill 1.fsts
with a collection of text form FSTs, one for each utterance space separated.
Ones with multiple terminal states have multiple different "reasonably likely” phrases possible.
The output labels on the transitions are words (Which we restored using int2sym).
The weights are the negative log likelihood of that transition (or that final state)
As shown below:
ad_16a
0 1 1 3 14.7788
1 2 6 8 5.0416
2 2.61209
ad_174o2o8a
0 1 1 3 12.5118
0 11 o 2 9.44585
1 2 7 9 9.34774
1 16 o 2 6.57278
2 3 4 6 2.08985
3 4 o 2 10.2191
4 5 2 4 4.91992
4 9 o 2 3.20784
5 6 o 2 3.84306
6 7 o 2 3.90951
6 13 8 10 7.07031
7 8 8 10 6.74935
7 14 o 2 3.79537
8 2.61209
9 10 2 4 5.3914
10 6 o 2 3.84306
11 12 1 3 4.75861
12 2 7 9 9.34774
13 2.61209
14 15 8 10 6.63099
15 2.61209
16 17 7 9 6.38392
17 3 4 6 2.08985
Then we grab one particular FST off of it. (in this case just using Awk to grab some lines – most sophisticated approaches exist). Compile it. Project it only along the input labels (cos they are the words it will guess at), Minimise the number of states to get a simpler but equivelent model (easier to read) and finally draw it as an FSA.
cat 1.txt.fsts | awk "6<NR && NR<30" |\
fstcompile --isymbols=data/lang/words.txt --keep_isymbols| \
fstproject | fstminimize| \
fstdraw --portrait --acceptor | dot -Tsvg > 1.2.svg
The Result of this, being a FSA that will accept (/generate) the likely matches for the utterance ad_174o2o8a
.
The utterance actually said “174o2o8”, which is accepted by the path through states “0,2,4,5,6,8,9,12”
Note: that when the confidence in the path being correct is very high no weight is shown.
Notice that the lattice has a lot of paths allowing ‘o’ to be followed by another ‘o’.
Drawing Phone Lattices
Much like we can draw lattices at the word level, we can go down to draw them at the phone level.
lattice-to-phone-lattice exp/mono0a/final.mdl "ark:gunzip -c exp/mono0a/decode/lat.1.gz|" ark,t:1.ph.lats
lattice-copy --write-compact=false ark:1.ph.lats ark,t:1.ph.fsts
utils/int2sym.pl -f 4 data/lang/phones.txt 1.ph.fsts > 1.ph.txt.wfsts
cat 1.ph.txt.wfsts | awk 'BEGIN{FS = " "}{ if (NF>=4) {print $1," ", $2," ",$3," ",$4;} else {print $1;};}' > 1.ph.txt.fsts
Notice that in the first step the model (final.mdl
) was also used.
The output of the first step is in the Compact Lattice form which is not amenable to being worked with by scripts like int2sym.
The second set expands it, making it a FST.
Third step is simply substituting the phone symbols into the output. It is worth looking perhaps at 1.ph.txt.fsts
, notices that the weights are only at start word phones. It is also however hard to read as it have hundred of empty string states (‘
With that done we now have something that looks like a collection txt.fst, however it is still very filled with epsilon states.
Now to draw it up. Capturing the utterance ad_174o2o8a
again, we will draw it:
cat 1.ph.txt.fsts| awk "199<NR && NR<1246" | \
fstcompile --osymbols=data/lang/phones.txt --keep_osymbols | \
fstproject --project_output | \
fstrmepsilon | fstdeterminize | fstminimize | \
fstdraw --portrait --acceptor | dot -Tsvg > 1.ph.svg
So the steps being again, grabbing the lines we want, compiling it. Projecting it (this time on the output space), removing epsilons (matchers for empty strings), determining, and minimising to make it more readable. Then drawing it.
(Click to view full screen image)
Scoring
Viewing Results
As the final step of steps/decoding.sh
the results are recorded.
The can be found in <decode-dir>
under filenames called wer_<N>
where N
is the Language Model Scale.
Example:
compute-wer --text --mode=present ark:data/test/text ark,p:-
%WER 1.63 [ 670 / 41220, 420 ins, 111 del, 139 sub ]
%SER 4.70 [ 590 / 12547 ]
Scored 12547 sentences, 0 not present in hyp.
The Wikipedia entry on Word Error Rate (WER), is a reasonable introduction, if you are not familiar with it.
The Sentence Error Rate (SER), is actually the utterance error rate. Of all the utterances in the test set, it is the portion that had zero errors. Both error rates only consider the most likely hypothesis in the lattice.
utils/best_wer.pl
will take as input any number of the wer_<N>
files,
and will output the best WER from amongst them.
How scoring is done
Scoring is done by local/score.sh
.
his program takes a --min-lmwt
, --max-lmwt
for the minimal and maximum language model weight.
It outputs the wer_N
files for each of this different weights.
The Language Model Weight, is the trade off (vs the Arctic model weight) as to which is more important, matching the language model, or matching the sounds.
The scoring program works by opening all the lattice files,
and getting them to output a transcription of the best guess at the words in all of the utterances they contain. The best guess is done with <kalid-trunk>/src/latticebin/lattice-best-path
the language model weight is passed to it, as -lm-scale
.
The WER is calculated using <kalid-trunk>/src/bin/computer-wer
,
which takes two transcription files – the best guess output in the previous step, and the correct labels. The program outputs the portion that match.