4

I'm able to analyze audio data using the AudioContext API in JavaScript and draw the waveform to a canvas.

The question is, after loading the audio data, I have about a 1024-long Uint8Array data points representing the wavelength (per frame), how do I guess what sounds this is making (from a choice of the phonetics mentioned here, namely:

lisa A

Closed mouth for the “P”, “B”, and “M” sounds. This is almost identical to the Ⓧ shape, but there is ever-so-slight pressure between the lips.

lisa B

Slightly open mouth with clenched teeth. This mouth shape is used for most consonants (“K”, “S”, “T”, etc.). It’s also used for some vowels such as the “EE” sound in bee.

lisa C

Open mouth. This mouth shape is used for vowels like “EH” as in men and “AE” as in bat. It’s also used for some consonants, depending on context.

This shape is also used as an in-between when animating from Ⓐ or Ⓑ to Ⓓ. So make sure the animations ⒶⒸⒹ and ⒷⒸⒹ look smooth!

lisa D

Wide open mouth. This mouth shapes is used for vowels like “AA” as in father.

lisa E

Slightly rounded mouth. This mouth shape is used for vowels like “AO” as in off and “ER” as in bird.

This shape is also used as an in-between when animating from Ⓒ or Ⓓ to Ⓕ. Make sure the mouth isn’t wider open than for Ⓒ. Both ⒸⒺⒻ and ⒹⒺⒻ should result in smooth animation.

lisa F

Puckered lips. This mouth shape is used for “UW” as in you, “OW” as in show, and “W” as in way.

lisa G

Upper teeth touching the lower lip for “F” as in for and “V” as in very.

This extended mouth shape is optional. If your art style is detailed enough, it greatly improves the overall look of the animation. If you decide not to use it, you can specify so using the extendedShapes option.

lisa H

This shape is used for long “L” sounds, with the tongue raised behind the upper teeth. The mouth should be at least far open as in Ⓒ, but not quite as far as in Ⓓ.

This extended mouth shape is optional. Depending on your art style and the angle of the head, the tongue may not be visible at all. In this case, there is no point in drawing this extra shape. If you decide not to use it, you can specify so using the extendedShapes option.

lisa X

Idle position. This mouth shape is used for pauses in speech. This should be the same mouth drawing you use when your character is walking around without talking. It is almost identical to Ⓐ, but with slightly less pressure between the lips: For Ⓧ, the lips should be closed but relaxed.

This extended mouth shape is optional. Whether there should be any visible difference between the rest position Ⓧ and the closed talking mouth Ⓐ depends on your art style and personal taste. If you decide not to use it, you can specify so using the extendedShapes option.

)?

I know there are many machine learning options like Meyda and Tensorflow and other machine learning methodss, but I want an algorithm to detech the above phonetics in real time. It doesn't have to be 100% accurate, just slightly better than randomly picking certain values for the mouths... At this point, anything better than random would be fine.

I'm aware audio recognition can be done with PocketSphinx.js, and this is used in rhubarb lipsink for its calculations, but all I'm looking for is a very simple algorithm, given a 1024 data-array of a wavelength per frame, of how to get the phonetics, again, it doesn't have to be 100% accurate, but it has to be realtime, and better than random.

Basically, the problem with pocketsphinx is that it's purpose is to get speech-to-word recognition, so it has a lot of extra code meant to translate the sounds to the exact words it has compiled in the dictionaries, but I don't need that I only need to extract the sounds themselves, without converting them to some dictionary, so theoretically there shouldn't be as much overheard.

I just want a simple algorithm that can take the already acquired data from the AudioContext, to just guess, relatively, what sound, in the above-mentioned list, is being made Again, to be very clear:

I am not looking for a PocketSphinx solution, nor any other "ready to go" library, all I want is a mathematical formula for each one of the unique sounds mentioned above, that can be adapted to any programming language

1 Answers1

0

I'm not sure why this is tagged tensorflow if you don't want a TensorFlow answer. If all you want is something better than random, you are almost certainly better off using a package like PocketSphinx and breaking the returned words down into their phonetics. What you're asking for is quite difficult: see threads discussing why here and here.

However, if you are absolutely attached to finding an algorithm for this...

Searching around, most items I came across used machine learning except for a few: this paper from 2008, this one from 1993, which was expanded into a full Ph.D dissertation, and this MIT research paper from 1997. Here is a sample of the algorithm the authors used in the last one, just for the /R/ sound:

algorithm

The paper says they implemented their algorithm in C++, but unfortunately no code is included.

Bottom line, I would recommend sticking with PocketSphinx unless this is part of your own Ph.D research!

UPDATE:

Adding more detail here upon request. Pocketsphinx explains, if you scroll all the way down to section 8 in their readme, that they use a machine learning platform called Sphinxtrain, which is also available in French and Chinese. But at the top of the Sphinxtrain page, there is a link to their "new library" called Vosk.

Vosk supports 9 languages and is small enough to fit on a Raspberry Pi, so it may be closer to what you're looking for. It, in turn, uses an open source C++ speech recognition toolkit called Kaldi, which also uses machine learning.

Arduinos are significantly more limited than Raspberry Pis, as I'm sure you know, so you may seriously want to reach out to the authors of the MIT paper if you are going in that direction. The authors used a 200 MHz Pentium Pro processor with 32 MB of RAM, and that's about the power level of the best Arduinos: the Arduino Yun 2 includes a 400 MHz Linux microprocessor with 64 MB of RAM.

Hopefully that gives you enough to chew on. Good luck!

Community
  • 1
  • 1
jdaz
  • 5,964
  • 2
  • 22
  • 34
  • OK thanks for the research papers, any idea what pocketsphinx is made from? do they usee machine learning or algorithms? And BTW it may be for a PHD paper theoreticaly, but i DO want an algorithm, because I might want to implement this in arduino and other things where pocketsphinx is too much overhead, also in the browser, pocketsphinx is about 10mb because it has to include the entire dictionaries, and it only works for english words, *and* a primary function of it is to to speech to *text*, which I don't care about, so I want to remove that overhead – B''H Bi'ezras -- Boruch Hashem Jun 02 '20 at 23:43
  • also how were you able to get that formula from the last link, I wasn't able to find any paper from 1997, and the ones I did find require a login? – B''H Bi'ezras -- Boruch Hashem Jun 02 '20 at 23:47
  • There's a small link with "PDF" next to it. [Here](http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-736.pdf) is the direct link. – jdaz Jun 03 '20 at 00:15
  • I've updated my answer with more detail based on your questions. – jdaz Jun 03 '20 at 00:42
  • hi thanks, although really all I'm looking for is *only algorithms*, no machine learning really, that first picture of the R sound was really good and exactly what I'm looking for, although I looked in that paper and coundn't find anything for the other sounds, if you can provide some kind of reference to where I can find those then this would be the accepted answer – B''H Bi'ezras -- Boruch Hashem Jun 10 '20 at 21:00
  • arduino is one thing, but Im also mainly trying to implement it in the browser, but it needs to be extremely light weight, pocketsphinx,js is over 10mb, and even the most inimilistic version is way too overkill and causes a ton of delay, it needs to be realtime even on older devices vosk doesnt accoimplish what I need because I need to just use the uadiocontext API with javascript in the client-side of the browser, while vosk is a binary dependancy (and way too big (50mb) anyways), and its meant for language detection, which is **not** what I'm looking for – B''H Bi'ezras -- Boruch Hashem Jun 10 '20 at 21:02