文章名为尝试,因为从来没用过linux系统,更别提shell脚本了,所以写得也很杂乱,以后会再整理出清晰一点的帖子;关于TIMIT数据的获取,可以参看CSDN资源贴。这里是使用TIMIT数据集运行aishell脚本。
参考资料
1.kaldi基础介绍(一)在说话人识别中的数据准备 - monsieurliaxiamen的博客 - CSDN博客
2.kaldi中改写sre10/v1用timit dataset做说话人识别总结 - zjm750617105的专栏 - CSDN博客
3.kaldi下清华语音数据集的说话人测试脚本编写 - 破晓的专栏 - CSDN博客
4.Voiceprint recognition in kaldi - Programmer Sought
5.kaldi中的声纹识别 - yutouwd的博客 - CSDN博客
6.【数据预处理】TIMIT语料库WAV文件转换 - JJJanepp - 博客园
7.利用kaldi提取mfcc特征 - 长虹剑的专栏 - CSDN博客
8.对TIMIT数据进行格式转换(SPHERE2WAV(RIFF)) - MengWang - 博客园
kaldi的文件结构
egs:保存各种例程,均使用脚本编写,以使用的数据库的名字命名。在下一级目录中以s开头的文件是语音识别,以v开头的是声纹识别,一般v1就是使用i-vector的方法来进行声纹识别。
src:保存了kaldi的C++代码。
tools:包括了kaldi依赖的库和一些实用的脚本。
windows:包括了在Windows下安装需要的一些工具和配置文件
TIMIT数据格式转换
timit的wav文件不是真正的wav文件,需要用kaldi的工具 sph2pipe 进行转换。
比如使用timit中的test/dr1/mdab0_si1039.wav 文件,首先用命令
fname=test_dr1_mdab0_si1039.wav #改了文件的名字
sph2pipe -f wav $fname >file.wav #要先转换格式
就得到一个真正wav格式的文件,这样你把它放到音乐播放器中就能够听到发音。(如果你的文件就是wav的就不用这样做了)
如何批量处理呢?对TIMIT数据进行格式转换(SPHERE2WAV(RIFF)) - MengWang - 博客园特别好用!
首先,转换sph2pipe工具所在文件夹(此工具为LDC所提供的SPHERE音频文件转换工具)
cd '/home/dream/Research/kaldi-master/tools/sph2pipe_v2.5'
其次:在命令行进行音频文件转换测试:
./sph2pipe -f wav ./wav_test/SA1.WAV ./wav_test/SA1_tr.WAV
此处需要注意的是,sph2pipe可执行文件不在PATH当中,所以需要当前路径下的完全路径,即:./sph2pipe才可以运行,而非sph2pipe。
认知数据集
训练集train:用来训练模型
验证集dev:用于调整模型的超参数,验证不同算法,检验哪种算法更有效
测试集test:根据最终的分类器,正确评估分类器的性能
很明显,TIMIT数据集并没有划分dev集,这是可以人为变更的,需要有相关的表单操作知识,思路是相通的,将wav文件都搜集在一起后根据自己的需求重新划分
数据集的划分
- 如果数据只有100条,100条或者1万条,通常将样本集设置为70%验证集,30%测试集。也可按照60%训练集,20%验证和20%测试集来分类较为合理。
- 如果数据规模较大,是百万级别,验证集和测试集要小于数据总量的20%和10%。
- 做科研的话一般会使用标准数据集,也就不用考虑数据划分的问题了
- 如果是第一次尝试,不用如何严苛地划分数据
Kaldi中的数据准备
在kaldi说话人识别示例(egs/sre10,egs/sre16)中,数据总共有两大类,一是训练集(training),二是评估数据集(evaluation)。对于评估数据集又分为两类,一是用来注册(enrollment)的数据集,二是测试(test)集。所以准备的wav文件夹下需要train,dev,test三个子文件夹
训练集的准备:spk2utt, utt2spk以及wav.scp
这些都有相应的程序可以生成,不用自己写脚本
- spk2utt 是说话人id(记作spkid)和说话人语音名称(uttid)的对应关系,通常来讲,一个说话人会有很多条语音,文件中的格式为<spkid> <uttid1> <uttid2>...,每一行有且只有一个说话人id。每一行的uttid顺序需要按照sort命令的排序模式来排,以及spkid也需要按照排序命令sort的模式来排。否则kaldi脚本在进行validate_data_dir.sh的时候报错。
- utt2spk 是单个语音名称uttid和说话人的对应,很明显每行都是一一对应关系。utt2spk也可以由kaldi自带脚本和spk2utt生成,也可以由自己写脚本完成。
用kaldi自带的命令utils/utt2spk_to_spk2utt.pl utt2spk >spk2utt 转过来得到,好像需要把文件放在utils目录下(?)
- wav.scp 是语音名称uttid和其完整路径的对应,也是每行一个音频。但是根据数据集中音频文件格式的不同,需要添加一些转换格式的命令。原始音频文件格式为wav,则只需要写uttid path:
格式为sph或者flac,则需要加入格式转换的命令行:
如果需要训练性别有关的模型,还需要加入spk2gender的文本文件(这里没用上)
注册集
对于说话人识别的评估,我们首先需要注册一批说话人,既然是注册说话人的声纹,则每个说话人需要至少有一条语音用来注册。对于测试集,则还需要一个已注册说话人和某个语音的id以及标签label(表明是否是同一个人)。
注册集和训练集一样,由spk2utt,utt2spk,wav.scp
组成。文本文件内容的模式也和训练集保持一致,这里不再赘述。
测试集
测试集是由spk2utt,utt2spk,wav.scp,trials
这四种文件组成。
trials文件格式: <spkid> <uttid label>
,如:
FADG0_SI649.WAV FADG0 target
FADG0_SI649.WAV FAKS0 nontarget
FADG0_SI649.WAV FASW0 nontarget
以上就是kaldi说话人识别数据集的准备格式,自己手动写脚本准备的时候,会遇到的问题,一是排序问题,主要会出现在spk2utt以及utt2spk文件中,因此在生成这些文件的时候就需要注意一定的顺序性,也要根据实际情况改变uttid的名字,以便于通过validate-data-dir脚本的检测(有自动排序修复程序);二是转格式的问题,目前遇到的sph和flac就需要不同的工具去转格式(sph2pipe和sox)。
只需准备三个数据文件就行了,train、enroll、test。SRE数据集用于训练PLDA模型,因为大部分训练数据超出了域。 在您的情况下,您将直接在训练数据上训练PLDA模型,因为它在域中。train应包括一组与enroll和test数据不重叠的spk,其余spk的utt应分为enroll和test。 虽然enroll和test共享speaker,但它们不应包含来自相同录音的utt。
官方论坛对TIMIT的解释
顺着github的Kaldi官方,找到了相应的Google网上论坛:Kaldi-help做相关摘录:
I'm not familiar with using TIMIT for speaker recognition, so I'm not sure how the evaluation is set up. It sounds like you might have only evaluation data and nothing to train your models with. Hopefully someone who has used TIMIT for this purpose can comment more. If you don't have any training data, you could try using the Librispeech corpus (look at the recipe in egs/ for more info).
You need at least the following datasets:
Training data. This is used to train the UBM, i-vector extractor and PLDA model. It should be non-overlapping with the other datasets. In the sre10 recipe, it corresponds to the "train" and "sre" data. The "sre" data is just a subset of "train" used to train the PLDA model, but it doesn't have to be that way in general.
Enrollment data. This is a subset of the evaluation data in which you know the identity of the speaker in the recording. Using the models created in the previous step, i-vectors are generated from this data. If you have multiple enrollment recordings per speaker, you might average their i-vectors to get speaker-level representations. In the sre10 recipe, this dataset is called "sre10_train."
Test data. This is also part of the evaluation data, and consists of recordings for which you don't know the identity of the speaker. These are compared (using the PLDA model or cosine distance) with the i-vectors created from the enrollment data. This dataset is called "sre10_test" in the recipe. **The set of comparisons is defined by the "trials" file. **
aishell流程
The run.sh in egs/aishell/v1 includes the entire voiceprint recognition process. It is best to copy the commands in run.sh to another script. In one sentence, one sentence at a time, so that errors can be found in time and then modified.
1)data preparation
2)start extracting mfcc features, perform endpoint detection (VAD), and check that the file does not meet the requirements to sort the files
3)train UBM and ivector extractor.It should be noted that the script that trains the ivector extractor will execute the program at the same time by default, which will take up a lot of memory and cause memory overflow. We need to modify it in train_ivector_extractor.sh. It defaults to executing njnum_threadnum_processes at the same time. In 16G memory, I changed these three parameters to 2 to run. There are also two hyperparameters that can be modified, namely the UBM dimension and the ivector dimension. The UBM dimension is modified directly in run.sh. The parameter behind the data/train in train_diag_ubm.sh is UBM. Dimension, the default is 1024. To modify the dimension of ivector, you also need to modify ivector_dim in train_ivector_extractor.sh. The default is 400.
4)extracting the ivector of the training set, and training the plda model for scoring with the ivector of the training set.
5)After that, the test set is divided into a registration set and a verification set. This step is mainly done by the script loacl/split_data_enroll_eval.py. This script first stores each spk and its corresponding utt in dictutt, then randomly smashes the utt order of the spk and redistributes it into enroll (registration set) and eval (evaluation set). You can see that in the penultimate line of the program, if(i<3): utt is written to enroll, otherwise it is written to eval. So we can change the value of the registration set and the evaluation set by changing this value.
6)After re-creating utt2spk, it is necessary to generate trials. Trials are generated by loacl/product_trials.py. Trials are a list of registered speakers and different voices that need to be scored. The format is (for example):
TIMIT+Aishell
We first look at the data partitioning in the AISHELL and TIMIT databases. There are a total of 400 people in AISHELL. The default is divided into train, dev and test sets. There are 340 people in the train; 40 in the dev; 20 in the test. In the routine, train is used as the training set, test is used as the test set, and dev is not used. Everyone in AISHELL has about 300 voices. Each voice is a sentence. Each voice is about 26s. There are 630 people in the TIMIT database, divided into train and test. There are 462 people in the training set and 168 people in the test set. Each person has 10 voices, and each voice is about 24s. Here, TIMIT's original distribution method is used directly, with 462 people as the training set and 168 people as the test set.
After understanding the differences between the two databases and the entire process of voiceprint recognition, we can begin to rewrite our program. In fact, there are not many places that need to be changed in the whole process. The main reason is that the process of preparing the data phase and generating trials needs to be modified. The first is the data preparation phase, we can rewrite our own tit_stat_prepare.sh according to the aishell_data_prepare.sh script. In the data preparation phase, three files, utt2spk spk2utt and wav.scp, are generated. The format of these three files is as follows:
Next, check if the found wav files add up to 141924, and then start wav.scp, utt2spk, and spk2utt and transcripts.txt for speech recognition. Here we will find the script related to transcripts.txt. Then delete it.
After completing the stage of preparing the data, we can start to perform voiceprint recognition according to the above process. One thing to note is that trials, if a person has only two or three segments of speech, you need to modify the proportion of the assigned enroll and eval sets. However, since everyone in the TIMIT database has 10 segments of speech, it is ok to not modify it. Here we use 3 segments of voice to register, and then the remaining 7 segments are used for verification.
The final error rate was about 4.5%. Although it is an acceptable result, it is still a lot worse than AISHELL's 0.18% error rate. Analyze the reasons: First, there are fewer voices for training. Although there are 462 people, each person has only 10 voices, and 340 people in AISHELL are used for training. Each person has a lot worse than 300 voices. Similarly, there are a total of 168 people in the TIMIT test set, which is much more than 40 people in the AISHELL test set. Moreover, AISHELL's default training UBM order and ivector dimensions are very high, so these two points may lead to higher error rates. If you want to further reduce the error rate, you can try to reduce the dimensions of the trained UBM and ivector. After I reduced the dimensions of both UBM and ivector, the error rate could eventually reach 1.53%.
其他补充
- 用sudo命令的话,会给新生成的文件上锁,解锁办法是
sudo chmod 777 file
- 单步调试:
sh -x script.sh
,修改脚本后,不想重新debug,就用BLOCK注释多行 - 线程设置在train_ivector_extractor.sh,就算按前文所说设置为2,我也跑不动,我将ubm的维数降为了600,ivector的维数降为了400,程序才能跑快点,也得1h。我怀疑可以更小的,因为utt很短
- 不用text的话,可以在一个子程序的开头设置,而不用找具体的位置去注释,我给忘了哪个程序了,大家可以运行出错排查
- 所有需要预先准备的就是数据集了,TIMIT数据格式和他的文件夹都是比较头疼的,最好写个脚本实现
- 在
aishell_data_prep.sh
里注释掉所有的dev和transcripts.txt