0

It is possible to extract the default phonemes for a given word via SAPI by:

  1. Voice word with text-to-speech and store output in a .wav
  2. Use the .wav as input for speech recognition
  3. Upon recognition of the word extract the phonemes from the recognized phrase elements

However I have not been able to capture (if available) emphasis markers ("1" and "2" per the American English Phoneme Table). Is there a way to do this?

EDIT: Here is what I've attempted so far (not pretty, but functional). Sadly it looks like the SpeechVisemeFeature always shows "SVF_None," even when I manually add emphasis to a word via SAPI Speech Dictionary modification. Does anyone know why this is?

using System;
using System.Threading;
using SpeechLib;
using System.Windows.Forms;

namespace PhoneEmphasis
{
    class Program
    {
        static string myWord = "hello";
        static SpPhoneConverter c = new SpPhoneConverter();
        static Thread t = null;

        static void Main(string[] args)
        {
            c.LanguageId = 1033;
            t = new Thread(test);
            t.Start();
            t.Join();
            Console.WriteLine("done");
            Console.ReadLine();
        }

        private static void test()
        {
            SpVoice v = new SpVoice();
            //v.EventInterests = SpeechVoiceEvents.;
            v.Phoneme += new _ISpeechVoiceEvents_PhonemeEventHandler(Phoneme_Handler);
            v.EndStream += new _ISpeechVoiceEvents_EndStreamEventHandler(EndStream_Handler);
            v.Speak(myWord, SpeechVoiceSpeakFlags.SVSFlagsAsync);
            Application.Run();
        }

        private static void Phoneme_Handler(int StreamNumber, object StreamPosition, int Duration, short NextPhoneId, SpeechVisemeFeature Feature, short CurrentPhoneId)
        {
            Console.WriteLine("Phoneme = " + c.IdToPhone(CurrentPhoneId).ToString() + " , VisemeFeature = " + Feature.ToString());
        }

        private static void EndStream_Handler(int StreamNumber, object StreamPosition)
        {
            Console.WriteLine("end stream!");
            t.Abort();
        }
    }
}
Exergist
  • 157
  • 12
  • It's unlikely that emphasis markers will be available in the phonemes, as the SR engine intentionally ignores emphasis. Have you considered using the SPEI_PHONEME event from the TTS engine? – Eric Brown May 22 '19 at 23:03
  • Thanks for the suggestion! This led me to find the SpVoice Phoneme event. But I'm really struggling to get it to fire. I'll update my original question with the code I'm trying. @EricBrown maybe you have some suggestions? – Exergist May 23 '19 at 20:28
  • Updated again with the events working – Exergist May 23 '19 at 21:29
  • This may not be the underlying problem, but it's possible that the phoneme sets are being translated from SAPI to UPS, and the code that does the phoneme conversion looks like it strips the SpeechVisemeFeature as part of the conversion. You could *try* to call [`ISpPhoneticAlphabetSelection::SetAlphabetToUPS`](https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee450895), which is implemented by ISpVoice, and see if changing the phoneme alphabet works better for you. Unfortunately, you're going to have to use C++ for that. – Eric Brown May 25 '19 at 00:07
  • Incidentally, if that doesn't work, I can't help much further. The TTS engine is remarkably opaque code, and almost entirely data-driven. – Eric Brown May 25 '19 at 00:09
  • I'll give that a try. I also tried this example with no success https://learn.microsoft.com/en-us/dotnet/api/system.speech.synthesis.phonemereachedeventargs?view=netframework-4.8. – Exergist May 25 '19 at 17:52
  • If I were to lift the restriction of using SAPI can the emphasis for a give word be found by other means? Maybe as UPS? – Exergist May 25 '19 at 17:58
  • Let’s step back for a second. Why do you need the emphasis markers? – Eric Brown May 27 '19 at 19:46
  • I'm developing an application to supercharge the speech dictionary interaction experience. I can already extract default pronunciations from a given word, but they don't include the emphasis markers. The word "hello" should actually be pronounced "h eh - l ow 1," but I currently can't get the "1" even though TTS clearly voices it. – Exergist May 27 '19 at 23:17
  • Well, I *suppose* you could use [ISpEnginePronunciation](https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms717841(v=vs.85)), but it's a pain to use, and isn't guaranteed to give you back a single pronunciation, even when you supply context. You'll also have to do this in C++, as there's no automation interface for this. – Eric Brown May 29 '19 at 04:22
  • Damn. And coding in C++ steps well outside my knowledge base. I wish there was a way to see what the output would be from ISpEnginePronunciation and if it includes the emphasis markers. @EricBrown are there any examples that could be referenced for how to do this in C++? – Exergist May 30 '19 at 18:13

0 Answers0