Waveform Comparison

Question

I am working on a personal research project.

My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.

For the first phase, I'm only using the English (US) phonetic alphabet. I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:

I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.

Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).

The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.

Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?

score 2 · Answer 1 · edited Jun 29 '19 at 01:12

2

Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)

edited Jun 29 '19 at 01:12

Alexis Wilke

19,179
10
84
156

answered Oct 30 '16 at 01:32

BilguunCH

51
3

Detecting engine fault via sound- that is too cool. Lukasz gave some great input. And thank you! I did just that and searched through several papers on Google Scholar and found lots of information, like the link I shared to Lukasz. It does show that a modified transformation function is viable and yields solid results( Discrete Tchebichef Transform), but it looks like the training process could be very long winded and messy with the massive vectors I could end up with. You were very right- researching the question and the elements of the question are essential. – Nikki Oct 31 '16 at 08:56

score 2 · Accepted Answer · edited Jun 29 '19 at 01:11

2

You could try Praat scripting.

Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.

You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.

Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.

edited Jun 29 '19 at 01:11

Alexis Wilke

19,179
10
84
156

answered Oct 30 '16 at 07:39

Lukasz Tracewski

10,794
3
34
53

I really appreciate the input! I tried using fft earlier today in Matlab and you're right- that was a really ugly vector. I'm going to look into the links. I'm actually more comfortable with Python- I'll look into the links. This is a great start. I looked up some papers in the field and using fft for sound analysis is... a feat. This is where I ended up while I was in the rabbit hole: https://core.ac.uk/download/pdf/35379497.pdf I'm just now getting to Talkbox (thanks to you) and it looks very promising so far. – Nikki Oct 31 '16 at 08:50
1

You can spend next month just reviewing the literature :). I know it's not what you are after, but you could try going the other way around: speech-to-text and then simply extract syllables. You can patch it in just a few hours with existing libraries in Python. This way you can take advantage of all work that has been put into speech recognition and then apply regex (yeah, rather lengthy one) to get what you need. You should get very good results. – Lukasz Tracewski Oct 31 '16 at 16:26

Waveform Comparison

2 Answers2