1

I am working with the HTK toolkit on a word spotting task and have a classic training and testing data mismatch. The training data consisted of only "clean" (recorded over a mic) data. The data was converted to MFCC_E_D_A parameters which were then modelled by HMMs (phone-level). My test data has been recorded over landline and mobile phone channels (inviting distortions and the like). Using the MFCC_E_D_A parameters with HVite results in incorrect output. I want to make use of cepstral mean normalization with MFCC_E_D_A_Z parameters but it would not be of much use since the HMMs are not modelled with this data. My questions are as follows:

  1. Is there any way by which I can convert MFCC_E_D_A_Z into MFCC_E_D_A? That way I follow this way: input -> MFCC_E_D_A_Z -> MFCC_E_D_A -> HMM log likelihood computation.
  2. Is there any way to convert the existing HMMs which model MFCC_E_D_A parameters into MFCC_E_D_A_Z?

If there is a way to do (1) from above, what would the config file for HCopy look like? I wrote the following HCopy config file for conversion:
SOURCEFORMAT = MFCC_E_D_A_Z
TARGETKIND = MFCC_E_D_A
TARGETRATE = 100000.0
SAVECOMPRESSED = T
SAVEWITHCRC = T
WINDOWSIZE = 250000.0
USEHAMMING = T
PREEMCOEF = 0.97
NUMCHANS = 26
CEPLIFTER = 22
NUMCEPS = 12
ENORMALISE = T

This does not work. How can I improve this?

Sriram
  • 10,298
  • 21
  • 83
  • 136

1 Answers1

2

You need to understand that telephone recordings have another range of frequencies because they are clipped in the channels. Usually range of frequencies from 200 to 3500 Hz is present. Wideband acoustic model is trained on the range from 100 to 6800. It will not decode telephone speech reliably because telephone speech misses the required frequencies from 3500 to 6800. It's not related to feature type or mean normalization or distortion, you just can't do that

You need to train your original model on audio converted to 8khz or at least to modify the filterbank parameters to match telephone range of frequencies.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • I implemented CMN (Cepstral Mean Normalization) for a similar (not same) problem. This technique is very useful when there is a mismatch between test and training data for speech recognition. There is a considerable amount of literature on this issue of test and training data mismatch. Here, I have the same problem, but in a different form - that of having HMMs in one form while I would like to have them in another. – Sriram Jul 30 '11 at 05:19
  • Continuing from above, I am not entirely convinced that converting to 8kHz is the way to go on this one. I dont have that much data/time. Modifying the filterbank is an option, but how do I go about it? – Sriram Jul 30 '11 at 05:31
  • 1
    > for a similar (not same) problem. There is nothing similar. CMN actually reduces information by dropping the mean vector from the features (thats why it's called normalization). Yes, it helps to deal with mismatched channel conditions.But you are trying to do the opposiet. You are trying to repair mean from E_D_A_Z (which has mean normalized) feature vector to restore E_D_A values which were used during the model training. There is no way to do that. Same about your idea to repair means in the acoustic model. – Nikolay Shmyrev Jul 30 '11 at 07:11
  • OK. I understand now. It was very stupid of me to expect any tool to reconstruct the mean once it has been removed. So, the solution to this problem lies in re-training the entire set of HMMs? – Sriram Jul 30 '11 at 09:44
  • Yes, you need to retrain them with a proper feature set and proper filterbank frequencies – Nikolay Shmyrev Jul 30 '11 at 15:43