kaidi之AISHELL2脚本阅读

前言

utils和steps文件夹是共享脚本，通用流程

数据集简介

AISHELL-2 is by far the largest free speech corpus available for Mandarin ASR research.

1. DATA

Training data

1000 hours of speech data (around 1 million utterances)
1991 speakers (845 male and 1146 female)
clean recording environment (studio or quiet living room)
read speech
reading prompts from various domain: entertainment, finance, technology, sports, control command, place of interest etc.
near field recording via 3 parallel channels (iOS, Android, Microphone).
iOS data is free for non-commercial research and education use (e.g. universities and non-commercial institutes)

Evaluation data:

Currently we release AISHELL2-2018A-EVAL, containing:

dev: 2500 utterances from 5 speakers
test: 5000 utterances from 10 speakers

Both sets are available across the three channel conditions.

2. RECIPE

Based on Kaldi standard system, AISHELL-2 provides a self-contained Mandarin ASR recipe, with:

a word segmentation module, which is a must-have component for Chinese ASR systems
an open-sourced Mandarin lexicon (DaCiDian, open-sourced at here)
Simplified GMM training & alignment generating recipe (we stopped at speaker independent stage)
LFMMI TDNN training and decoding recipe

脚本大纲

run.sh
** local/prepare_all.sh
**local/run_gmm.sh
**local/chain/run_tdnn.sh

run.sh 脚本解析

1. 数据准备

local/prepare_all.sh ${trn_set} ${dev_set} ${tst_set} || exit 1;

input: trn_set /dev_set / tst_set
三个参数分别对应训练集、验证集和测试集的文件路径，该步骤分别对训练集，验证集和测试集做数据准备。

2. 训练GMM-HMM

local/run_gmm.sh --nj $nj --stage $gmm_stage

3. 训练chain模型

local/chain/run_tdnn.sh --nj $nj

4. 解码结果

local/show_results.sh

aishell的run.sh风格和wsj不太一样，在run.sh中只展示了三个主框架流程：

数据准备（特征提取，词典，语言模型等）
训练基本的HMM-GMM模型，得到一个初步的对齐
根据GMM-HMM的对齐训练MMI-TDNN模型
私以为这种代码形式更加清晰易懂，下面逐一进行分析

prepare_all.sh 脚本解读

1. 词典准备

从github上下载《大辞典》原始数据，转化为kaldi的词典格式

local/prepare_dict.sh data/local/dict || exit 1

input
download_dir = data/local/DaCiDian(从github上下载)
output
dir = data/local/dict
dir/lexicon.txt
dir/nonsilence_phones.txt
dir/optional_silence.txt
dir/extra_questions.txt

2. 生成wav.scp, txt(word-segmented),utt2spk,spk2utt

 local/prepare_data.sh ${trn_set} data/local/dict data/local/train data/train || exit 1;
 local/prepare_data.sh ${dev_set} data/local/dict data/local/dev   data/dev   || exit 1;
 local/prepare_data.sh ${tst_set} data/local/dict data/local/test  data/test  || exit 1;

(以下以一个数据集为例）

input
trn_set（训练集所在文件路径）
data/local/dict（步骤1生成的词典文件路径，保存kaldi格式的lexicon）
output
data/local/train(tmp-dir)
data/train(output-dir)

3. 生成词典L.fst文件

utils/prepare_lang.sh --position-dependent-phones false \
  data/local/dict "<UNK>" data/local/lang data/lang || exit 1;

input
data/local/dict（步骤1生成的词典文件路径，保存kaldi格式的lexicon）
“UNK”（out of vocabulary word）
output
data/local/lang（临时文件）
data/lang/phones（目标文件）

4. 语言模型准备

local/train_lms.sh \
     data/local/dict/lexicon.txt data/local/train/text data/local/lm || exit 1;

input
data/local/dict/lexicon.txt（步骤1生成的词典文件）
data/local/train/text（步骤2生成的数据）
output
data/local/lm（生成的LM）
data/local/lm/3gram-mincount/lm_unpruned.gz
extras
用SRILM做的工作
输出：data/local/lm/srilm

5. 生成语言模型G.fst文件

该程序的主要目标就是根据语言模型生成G.fst文件。方便与之前的L.fst结合，发挥fst的优势。

utils/format_lm.sh data/lang data/local/lm/3gram-mincount/lm_unpruned.gz \
    data/local/dict/lexicon.txt data/lang_test || exit 1;

input
data/lang（步骤3生成的词典L.fst）
data/local/lm/3gram-mincount/lm_unpruned.gz (步骤4生成的LM)
data/local/dict/lexicon.txt（步骤1生成的词典文件）
output
data/lang_test （G.fst文件所在位置）

local/run_gmm.sh脚本解读

1. 特征提取MFCC&CMVN

mfccdir should be some place with a largish disk where you want to store MFCC features.

1.1 MFCC特征提取

Combine MFCC and pitch features together
Note: This file is based on make_mfcc.sh and make_pitch_kaldi.sh

steps/make_mfcc_pitch.sh --pitch-config conf/pitch.conf --cmd "$train_cmd" --nj $nj \
      data/$x exp/make_mfcc/$x mfcc || exit 1;

input
data/train(dev test)（步骤2生成的数据文件）
output
exp/make_mfcc/train(dev test)（log文件）
mfcc or data/train(dev test)/data（mfcc_pitch_dir）

1.2 CMVN特征提取

steps/compute_cmvn_stats.sh data/$x exp/make_mfcc/$x mfcc || exit 1;

input
data/train(dev test)（步骤2生成的数据文件）
output
exp/make_mfcc/train(dev test)（log文件）
mfcc or data/train(dev test)/data（mfcc_pitch_dir）

?和mfcc在一个文件夹里？

1.3 检查数据

This script makes sure that only the segments present in all of "feats.scp", "wav.scp" [if present], segments [if present] text, and utt2spk are present in any of them. It puts the original contents of data-dir into data-dir/.backup

utils/fix_data_dir.sh data/$x

input
data/train(dev test)（步骤2生成的数据文件）

2. 划分训练子集

subset the training data for fast startup

utils/subset_data_dir.sh data/train ${x}000 data/train_${x}k

3. 单音素模型训练

3.1 训练步骤

主要输出为final.mdl和tree。训练的核心流程就是迭代对齐-统计算GMM与HMM信息-更新参数

steps/train_mono.sh --cmd "$train_cmd" --nj $nj \
    data/train_100k data/lang exp/mono || exit 1;

input
data/train_100k（数据准备步骤2中生成的数据文件子集）
data/lang（数据准备步骤3中生成的L.fst）
output
exp/mono

3.2 解码步骤

采用刚刚训练得到的模型来对测试数据集进行解码并计算准确率等信息

3.2.1 构造解码图

构造一个完全扩展的解码图（HLCG.FST），能表示语言模型、词典、上下文相关性和HMM解构。
脚本原注释如下：
This script creates a fully expanded decoding graph (HCLG) that represents all the language-model, pronunciation dictionary (lexicon), context-dependency,and HMM structure in our model. The output is a Finite State Transducer that has word-ids on the output, and pdf-ids on the input (these are indexes that resolve to Gaussian Mixture Models).
See
http://kaldi-asr.org/doc/graph_recipe_test.html
(this is compiled from this repository using Doxygen,
the source for this part is in src/doc/graph_recipe_test.dox)

utils/mkgraph.sh data/lang_test exp/mono exp/mono/graph || exit 1;

input
data/lang_test——lang-dir
exp/mono——model-dir（tree&final.mdl）
output
exp/mono/graph——graph-dir

3.2.2 解码测试

通过调用gmm-latgen-faster或gmm-latgen-faster-parallel进行解码，生成lat.JOB.gz

  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.conf --nj ${dev_nj} \    exp/mono/graph data/dev exp/mono/decode_dev
#for dev
  steps/decode.sh --cmd "$decode_cmd" --config conf/decode.conf --nj ${test_nj} \
    exp/mono/graph data/test exp/mono/decode_test
#for test

以训练集为例：

input
exp/mono/graph——graph-dir
data/test——data-dir
output
exp/mono/decode_test——decode-dir

3.3 对齐步骤

用训练好的模型将数据进行强制对齐方便以后使用

steps/align_si.sh --cmd "$train_cmd" --nj $nj \
    data/train_300k data/lang exp/mono exp/mono_ali || exit 1;

input
data/train_300k——data-dir
data/lang——lang-dir
exp/mono——src-dir
output
exp/mono_ali——align-dir

4. 三因子模型训练

4.1 训练步骤

4.2 解码步骤

4.3 对齐步骤

local/chain/run_tdnn.sh脚本解读

关于kaldi的chain模型目前没有完全了解，后续进行补充

Reference

张涵沛的博客