I am developing a text-to-speech (TTS) for my own language in Java (its a final project which has not been developed before, therefore I cannot use built-in classes).
I can recognize the diphones for input text.
For playing, I place the diphones in an array after the input text analysis is completed. Upon which, I play audio files (which are in ogg format) according to the diphones in the array (one by one).
What I want to ask is, what do you think about this method for playing separate diphones? Right now I have (big) gaps between the playing of each audio clip that I am trying to smooth out. Any ideas?