Trim syllable audio recording to only the vowel part

Question

For a Chinese learning app, we let users record a syllable and we use speech recognition to assess if the pronunciation was correct or not.

Every Chinese syllable can be pronounced with different tones (pitch differentials) that have different meanings. We found that both Google Translate and Swift Speech framework are not accurate enough to determine wether the pronounced tone was correct or not. Therefore, we use Beethoven to detect the pitch from the audio to assess this outside of the speech recognition API.

The challenge is that in Chinese the tone is only pronounced in the vowels of syllable. So Beethoven works well if the user only pronounces a vowel, e.g. "a". But in a syllable such as "san" the the results are clouded by the consonants "s" and "n".

So I'm looking for a way to trim the syllable recording to only the vowel so we can use Beethoven on the vowel only and detect the Chinese tone correctly. I'm also happy to learn if anyone has a better idea on how to tackle this challenge.

Best, Paul

score 2 · Answer 1 · answered Sep 29 '21 at 22:35

2

One fact about vowels and consonants that might be helpful is that vowels can are generally thought of as having frequency content that tends to be harmonic and concentrated in formant areas (the first two being the most important, and the 2nd of which is below 3K Hz), and many consonants (fricatives, sibilants) have noisy energy at or above 4K Hz. Here is a good diagram from a lecture on the acoustics of fricatives where this can be seen.

You might need a more sophisticated fast-fourier analysis tool than Beethoven to distinguish when the sibilants' or fricatives' frequency content is present. I've not used Beethoven and do not know what its capabilities are.

I don't know much about the nasals, though. The same lecture series, different chapter ("Plosives and Nasals") gives this info:

The nasalisation of vowels is cued by the presence of a low-frequency resonance and an increase in formant damping.

It seems to me like it would be challenging to distinguish nasals from vowels by their the spectrum.

answered Sep 29 '21 at 22:35

Phil Freihofner

7,645
1
20
41

Occurs to me this question might be better posted at https://dsp.stackexchange.com/ since this level of timbre analysis involves DSP tools and concepts. My answer comes not so much from direct experience with speech recognition tools as with a more general knowledge about acoustics, formants and phonetics. – Phil Freihofner Sep 30 '21 at 05:53
Thanks Phil for your support! That distinguishment in the nature of the vowels and consonants seems like the way to go. I didn't know dsp Stack Exchange, will post there as well! Thanks again. – Paul Sep 30 '21 at 14:01
There are charts that show the different formant constructs for different vowels. Might be "project creep" or an interesting proposal: if you can distinguish the formant peaks in your DSP analysis, to create sounds in response that show what the speaker did and what they should aim for, to highlight the difference, along with relevant diagrams of the front/back or open/close positions. I have, for example, written a tool that can generate and play, in real time, a pitch that has a dynamic Hz, volume, and two dynamic formants. – Phil Freihofner Oct 01 '21 at 17:12

Trim syllable audio recording to only the vowel part

1 Answers1