Converting real-time audio to phonemes

Question

Using a microphone as an input for real-time audio. How do I extract the currently said phoneme from the audio? I need it for lipsyncing 2d characters.

Basically, my approach would be to:

Fetch the real-time audio using a microphone
Detect the current phoneme that is being pronounced from the audio.

I have tried looking everywhere for an example or library that could solve this type of problem. Most libraries don't seem to output phonemes from audio.

There is a website that explains how they used machine learning to solve this, however without any code or tutorial on how to do it. https://www.arxiv-vanity.com/papers/1910.08685/

There is also this cool speech recognition tool called Pocketsphinx, but I cannot seem to find an example of it using Phoneme Recognition yet.

score 1 · Accepted Answer · answered May 18 '23 at 22:45

1

The way I would approach this is to get the word from the audio using Whisper or a similar STT service (the Python Speech Recognition Library is the go-to at the moment), then I would use the CMU Dict Library to provide phonemes for each word.

The phonemes are given using the CMU dictionary - for example DH for the θ phoneme - the th sound in this and that. That is, they are not given in IPA pronunciation - so you may need another layer if you need the phonemes in IPA format. If you need IPA formatted phonemes, then consider the IPA2 library.

answered May 18 '23 at 22:45

Kathy Reid

575
4
6

1

Thanks for the answer! Managed to achieve (somewhat) real-time Speech-To-Phonemes by transcribing streaming audio, whilst splitting the words into it's different phonemes. While it isn't (entirely) real-time, a 300ms delay should suffice :) – NectoJ Jul 05 '23 at 17:16
1

You're welcome! Interesting project, and good luck with it. – Kathy Reid Jul 06 '23 at 05:32

Converting real-time audio to phonemes

1 Answers1