Synthesize phoneme pairs on OSX

Question

I need to create wave-files of 144 phoneme-pairs, such as "Da Di Du, Beh Bi Burr, ..."

Specifically I need each one to maintain a constant pitch, so that I can pitch-shift them to make musical notes (If I could input pitch values that would be even better!).

I don't really want to record 144 .WAV files of me trying to sing them.

Can I do this using OSX's inbuilt speech synthesis API?

If not, is there any other way I can do it?

EDIT: I don't require any particular quality grade. The important thing is that each utterance is distinguishable and at the correct pitch.

EDIT: I will put my attempts at solving this below, if I reach something I'm happy with I will break it into an answer.

Speech Synthesis Programming Guide seems to have everything, it talks about controlling the pitch using contours here, and typing phonetic input here.

However, it would be a lot of work to figure out the whole API and write an OS X project to do it. So I'm interested in commandline options or using existing synthesisers.

CRGreen's answer users parameters to 'say' that I can't find documented in the manpage:

Just found an example here: http://hints.macworld.com/article.php?story=20120204172337402

EDIT: Phonemes https://apple.stackexchange.com/questions/53858/in-terminal-how-to-get-say-to-say-things-right-ie-using-custom-phonetics

CRGreen · Accepted Answer · 2014-05-22T05:48:01.103

In AppleScript script editor:

set diphones to {"Dah", "Di", "Du", "Beh", "Bi", "Burr"} --etc.

set targetFolder to ((choose folder) as text)

repeat with p in diphones
    say p using "Vicki" pitch 55 modulation 0 saving to (targetFolder & p & ".aif")
end repeat

Then convert the files to WAV.

There are a few other options available in the "say" command dictionary.

I don't think it is as simple as that, however. How the speech synth treats these diphones can be weird, and even different according to which voice you use. You may have to manipulate quite a few to sounds to be the way you want. For example, Vicki says "Di" like "DEE" and "Bi" like "BYE". It is really hard to get those voices to intone a short "i" (as in "big") as just the diphone. It may even be necessary to have it say "big" (for example), then edit the sound in Audacity, cutting off the end and putting a fade out at the end of the edited version, then exporting that. I just did this and it works, but yeah, you'll need to do some special case adjustments. If you have the Developer tools, there is also an app called "Repeat After Me" which allows you to "tune" spoken text, but (surprisingly) for the situation I just described, it doesn't help. (It is pretty powerful for larger chunks, though).

[edit] so, yes, the phonetic input version of the above could be like this:

set diphones to {"dAO", "dIH", "dAX", "bEH", "bIH", "brr"} --etc., changed to be phonetic based on Apple's system

set targetFolder to ((choose folder) as text)

repeat with p in diphones
    say ("[[inpt PHON]]" & p & "[[inpt TEXT]]") using "Vicki" pitch 52 modulation 0 saving to (targetFolder & p & ".aif")
end repeat

[ADDENDUM]

Years ago Apple's voices would all act the same, and you could tune any voice to perfectly sing a song (I did the "Star Spangled Banner" one night). Then, for some reason, the developers not only changed the voices, but took away the consistency so that some voices behave completely differently compared to others. I wasn't happy about this. Consider the following:

Using the default voice ("Alex"), the following utterance is (you'll be encouraged to find) even as can be:

say "[[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Alex"

But if you use "Cellos" or "Pipe Organ", you get that bizarre lift at the end, even if you use this TUNE mode. Don't ask me why. So how did I get this to work, at least for "Alex"? I used the aforementioned "Repeat After Me" app and simplified the "tuned" output. I think you can probably get what you want using some variation of TUNE and PHON. But you'll probably have to stay away from "Cellos" and "Pipe Organ" because they are problematic for making monotonous intonations (although they may be fine for certain diphones/triphones). And maybe you'll have to use both, which is, I know, annoying. I feel your pain.

One more variation. Notice the way the following "rate" tag forces a longer utterance:

say "[[rate - 66]] [[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Alex"

[ADDENDUM II]

Ah, but check this out. This evens out the "Pipe Organ"; gets rid of the end lift by forcing a pitch modulation ("pbas") before the last phoneme:

say "[[rate - 66]] [[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} [[pbas - 5]] 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Pipe Organ"

They're making us work way too hard here :-)

Here's a simplified version, going back to your original but sticking that pbas in there:

say "[[inpt TUNE]] d UW [[pbas - 5]] 1IY [[inpt TEXT]]" using "Pipe Organ"

Unfortunately this is no good for my purpose, as the pitch changes within the utterance, just as human speech would. If only I could find something more primitive... — P i, May 19 '14 at 21:13
Using the `modulation 0` parameter helps to remove pitch changes. Not sure what your acceptable tolerances are. See edited code. How good of a voice do you need? espeak is an open source command line program that has similar capabilities (and outputs WAVs), but the voice is pretty, let's say, robotic. — CRGreen, May 20 '14 at 01:45
Where do you find this? Man-page for 'say' doesn't mention it. Is it possible to do a single test from the command line without AppleScript? I've edited the question. — P i, May 20 '14 at 09:52
@Pi, the AppleScript version of say has more options. see http://applescript.wikia.com/wiki/Say (you can access the dictionary via the script editor) — CRGreen, May 20 '14 at 17:05
Why you would have such an aversion to opening the script editor I don't know, but to test in the command line: `osascript -e "say \"dah\" using \"Vicki\" pitch 55 modulation 0"` (this is using applescript, of course) — CRGreen, May 20 '14 at 17:10
Ach, yes - [[inpt PHON]] is the way to go. You just have to customize your list of diphones. I wrote a whole phonemic lip sync thing using apple's phoneme system a while ago and kind of forgot about how well [[inpt PHON]] actually works -- especially in your case where you can use modulation 0 and get the short diphones to be very even and simple. — CRGreen, May 21 '14 at 05:25
I've had a play with it and knocked out a script which I've put as an answer. Do you have any idea why it mangles the last vowel when there is more than one? PS the script editor and I do not like one another. It has crashed twice and refuses to save edited files. — P i, May 21 '14 at 16:42
And if you're wondering why I keep putting the [[inpt TEXT]] in there, it's because it is generally considered bad practice to leave it in a non-default state. If a future user just wants to enter text to speak, they'll get bad results with it left in one of these other "inpt" modes. — CRGreen, May 22 '14 at 05:54

score 1 · Answer 2 · answered May 21 '14 at 16:41

I've managed to get it kind of working with the following script:

-- to run, '/usr/bin/osascript genPhonemes'

-- https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/SpeechSynthesisProgrammingGuide/Phonemes/Phonemes.html
-- http://stackoverflow.com/questions/23742648/synthesize-phoneme-pairs-on-osx
-- http://applescript.wikia.com/wiki/Say

set Vowels to { "AA", "AY", "EH", "EY", "IY", "AO", "OY", "UW", "UWIY", "AX", "AXIY", "IH"}
set Consonants to { "d", "b", "r", "N", "m", "v", "S", "z", "h", "l", "k", "t" }
set NoteOffsets to { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -3, -2, -1 }
set NoteNumbers to { "00", "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11" }

set targetFolder to "OUT" -- ((choose folder) as text)

repeat with i from 1 to 12
    set C to (item i of Consonants)

    set midinote to 60 - 12 + (item i of NoteOffsets)

    repeat with j from 1 to 12
        set V to (item j of Vowels)

        set filename to targetFolder  &  C & "_" & (item j of NoteNumbers) & ".aif"

        set utterance to "[[inpt PHON]]" & C & V      

        say utterance   using "Pipe Organ"   speaking rate 120   pitch midinote   modulation 0   saving to filename
    end repeat
end repeat

For some reason vowel-pairs are coming out wrong. The second vowel is getting lifted in pitch. Using Pipe Organ, the last vowel is a perfect fourth higher.

So e.g. dUWIY, which sounds like "doo-ee", the ee at the end is a perfect fourth higher.

The only other suitable voice is Cellos, which also mangles it, though by a smaller interval, maybe a semitone.

Is there any way to fix this?

Best way to get rid of the "lift" at the end is to use "pbas" pitch modulation right before that last phoneme. See last 2 examples in answer. — CRGreen, May 22 '14 at 21:22

Synthesize phoneme pairs on OSX

2 Answers2