Playing ogg voices as smoothly as possible in Java - TTS application

Question

I am developing a text-to-speech (TTS) for my own language in Java (its a final project which has not been developed before, therefore I cannot use built-in classes).

I can recognize the diphones for input text.

For playing, I place the diphones in an array after the input text analysis is completed. Upon which, I play audio files (which are in ogg format) according to the diphones in the array (one by one).

What I want to ask is, what do you think about this method for playing separate diphones? Right now I have (big) gaps between the playing of each audio clip that I am trying to smooth out. Any ideas?

score 1 · Answer 1 · answered Feb 25 '12 at 13:00

1

In diphone synthesis it is common to split the diphones at the middle of a phone, where it is most stable, and stitch them together that way. So, for example, to synthesize the word "meeting" I would start with a m iy phone (in ARPAbet symbols), then cut it off in the middle of the iy and splice into an iy dx diphone in which both phones were split in half, and so on, ending with a ix ng diphone where the ng is complete.

In order to do this you need to know the time index in each .ogg that corresponds to the middle of a continuous phone or the gap between closure and release of a stop.

answered Feb 25 '12 at 13:00

Russell Zahniser

16,188
39
30

my problem is with playing these diphones, how I can play them in a way that it's smooth and without gaps between them?? – Nawras Feb 26 '12 at 17:50
So, my suggestion would be to play the first .ogg just up to an index in the middle of the second phone, then immediately start the second .ogg from halfway through the same phone. (You could smooth the transition a little by fading in and out, but the basic idea is to splice mid-phone) – Russell Zahniser Feb 26 '12 at 19:19
nice idea, but could you show me a simple example for how to do it? or post a link that talks about this. – Nawras Mar 02 '12 at 11:44
Unfortunately the idea comes from a textbook (Jurafsky & Martin) that doesn't seem to be searchable online. If it were me, I would start by trying to stitch phones together in an audio editing program where I could carefully align the waveforms; once that sounded good I would try to figure out how to reproduce the same thing in code. – Russell Zahniser Mar 02 '12 at 13:43

Playing ogg voices as smoothly as possible in Java - TTS application

1 Answers1