How can I find the best matching sound file(s) to a sample sound file?

Question

I'm working on a very simplistic speech recognition project. I currently have 2 sets of wav files. Each set has 1-second long recordings of a set of words spoken by the same person at 2 different instances. For example, one set has the words "one", "two", and "three", and the other set has the same exact words obtained through a separate recording. Many of the words rhyme and use somewhat different sounds.

I've tried several things thus far, but the most practical thing I've gotten thus far is spectrograms (all constructed the same way using the same script) for each sound file.

This has all been done through MATLAB and I may only use MATLAB.

I will refer to one set of recordings/spectrograms as the "sample set", and that will the set from which I will provide the sample sound. I will refer to the other set of recordings/spectrograms as the "test set", and that will be the set from which I will try to find the best match to the provided sample recording/spectrogram.

What I would like is, when provided with a sample sound/spectrogram, MATLAB will return the best match or matches from the test set. Ideally, it will return the same word, but realistically I will be very happy with just some of the samples returning similar results (e.g. words that rhyme or have similar vowels/consonants).

What are some approaches I could try? Again, it is fine if this fails as long as the process is reasonable. I understand I have a very small sample size of sounds. I also understand it would be best to compare the sounds in the frequency domain, but all I have as of right now are spectrograms.

score 1 · Accepted Answer · answered Jan 21 '14 at 18:14

1

Dynamic Time Warping can give a measurement of the distance between two utterances. You can find a Matlab implementation in Matlab Central

answered Jan 21 '14 at 18:14

lCapp

892
9
18

Could you provide more detail or an example? – Steve Westbrook Jan 21 '14 at 18:32
1

Sorry, I never used it myself; I just know DWT is a possibility. Have a look here http://csl.anthropomatik.kit.edu/downloads/vorlesungsinhalte/MMMK-PP05-DynamicProgramming-SS2012.pdf – lCapp Jan 23 '14 at 16:59

Adiel · Answer 2 · 2014-01-20T22:10:23.190

The spectrogram is great. You can extract formants, look here how to do it.

Basically formants are features of separate syllable, i.e. for the word "three", there are different formants for 'th', 'r', and 'i'. so, you better separate the syllables first, then extract the formats for each syllable, and finally compare the "sample" to "test" files.

Anyway, if each file contains only a single word, I think that extracting formants for whole word can also be suitable approach, especially if you may have some tolerance of error...

EDIT:

So, I still think that extract formants is the right way, but if you want comparing spectrograms, you can rely on the fact that the words has one vocal syllable. You can see in the spectrograms that the vocal part has peaks in high frequencies (for example, the spectrogram in the link above shows the word "matlab", and has red lines in higher frequencies at the two vocal 'a').

Devide the spectrogram in the time dimension to segments of 50 ms (+-), and pick those which their peaks are in high frequencies (according to some threshold that you need to chose. It will be easy after watching the spectrograms). For each word, save the location in time, and location of the frequency of the 3-4 high peaks for the time period that you choose. Now, according to your specific data you need to try and determine what exactly the tolerance you allow in time/ferquncy, to define two wards as similiar...

Do I have to extract the sounds in order to get any results? I understand that is the best method, but I was really hoping for a way to just find the best matching spectrograms of each word. All of the words are very short and most of them are monosyllabic so there is just one prominent vowel. — Michi, Jan 20 '14 at 20:37
The words in the question are not monosyllabic, look at my example about "three". But, Try to do it on the whole word,as it was monosyllabic, I think it should work. — Adiel, Jan 20 '14 at 20:48
So, extraction aside. what tool can I use to compare spectrograms to find their level of similarity? — Michi, Jan 20 '14 at 20:52

How can I find the best matching sound file(s) to a sample sound file?

2 Answers2