-1

I need to analyse a sentence/phrase and the output time it takes to utter each word. for example, In the sentence

How can mirrors be real if our eyes aren't?

I need this

  Word      Time   
 --------- ------- 
  How       101ms  
  can       95ms   
  mirrors   180ms  
  be        70ms   
  real      120ms  
  if        80ms   
  our       99ms   
  eyes      101ms  
  aren't?   180ms  

(I made this one up. these are not the actual utterance times)

One method of doing this is by assuming that word length is proportional to utterance time, but this isn't always true ('Queue' and 'Q' have the same utterance time although they differ in word length)

Also presence of punctuation marks have to be factored in.

Bonus: Recognizing Emotions :)

Can anyone point me to algorithms/papers which does this? Is there any way to hack this up from existing Text-to-speech code? Java code suggestions are appreciated!

Atul Vinayak
  • 466
  • 3
  • 15
  • Hmm... you want that result in Java platform? I have a sample like that but this sample is operated only in Android. If you don't care platform, I will write answer with my sample code. – Kae10 Feb 05 '16 at 06:46
  • android is fine :) except is its not by using utterance listener and breaking down sentences to words – Atul Vinayak Feb 05 '16 at 11:21
  • oh.. I used exceptions that you said. Why don't use them? Is there any special reason? – Kae10 Feb 05 '16 at 17:55
  • @Kae10 Tried this method. It sounds choppy. It sounds like "Mary..Had..A..Little..Lamb" instead of "Mary had a Little Lamb" – Atul Vinayak Feb 16 '16 at 11:34

2 Answers2

2

Yes this is the sort of problem that is solvable by a machine learning algorithm. Like you point out similar sounding words can have different times. I would suggest to use a machine learning algorithm, specifically a two layer neural network, and feed it in with a larger dataset. THese algorithms are quite well known. Then the neural network can give give you an estimate of the time-it would learn for example how to estimate the time foe q or Queue depending on the context. And another advantage of using a machine learning algorithm is that if you decode live speech (i.e. a new input) to text it will give you an estimate on this new input.

  • This has to run on a low power device, so ML doesn't cut it. This is a sub-problem of the general text-to-speech technology which has already been solved. Can you suggest something in this domain? – Atul Vinayak Jan 27 '16 at 07:05
1

I have an idea...

If you want a very precise result:

Have a map that has the result of time-count for every possible word. This is exhaustive, but implementation is self-explanatory and really easy.

If you want a good approximation to the result:

Get some initial data that tells you how much time it takes to utter a syllable. There can be short syllable or a long syllable. Get the initial result to find out how much time it takes to utter a short syllable (like a, the, queue) and how much for a long syllable (like an, eyes etc). Also, you can have what time it takes for punctuation.

Sample:

short: 50ms
long: 100ms
comma: 20ms
full-stop: 35ms etc.

Now get a count and multiply to get the result.

You can update the values if you find some exceptions, eg. "screeched" is a single syllable but definitely takes much more than 100ms. You can have levels of time taken to utter a single syllable. (like previous example had 2 levels- long/short). You can start with 4 levels (short/mid/long/very long etc.)

vish4071
  • 5,135
  • 4
  • 35
  • 65