I'm building a simple program that speaks phone numbers in a human voice.
For that I pre-recorded each digit (with different intonations), and when I get a number I join the audio files and play them together with some silence added between the numbers.
However, this doesn't sound smooth or natural.
I tried to do gain and tempo normalization on the files but it feels like I need to join them in some "smart" way so that the transition will sound natural.
I looked for some algorithms to do that but didn't find anything.
Is there are a known method for that?
Thanks.