Algorithm for concatenating speech audio to sound continuous?

Question

I'm building a simple program that speaks phone numbers in a human voice.

For that I pre-recorded each digit (with different intonations), and when I get a number I join the audio files and play them together with some silence added between the numbers.

However, this doesn't sound smooth or natural.

I tried to do gain and tempo normalization on the files but it feels like I need to join them in some "smart" way so that the transition will sound natural.

I looked for some algorithms to do that but didn't find anything.

Is there are a known method for that?

Thanks.

It would be helpful if you could add a visualization of one result signal including the spectrum. You could use [praat](http://www.praat.org) for that. It would make things easier such as spotting simpler issues e.g. related to the transition of noise to absolute silence. — Lars Schillingmann, Nov 02 '17 at 22:17
If you are after a simple approach you could look into "legato" (from music) and apply that to the voice by recording "legatos" between the various number and use that for transition. — , Nov 03 '17 at 04:51

score 8 · Answer 1 · answered Oct 25 '17 at 08:43

8

The algorithm is called PSOLA. There are variations like TD-PSOLA.

Overall there are many things here - how to decide which items to join based on acoustic properties, source intonation and required target intonation. It is all pretty complex to implement so it is better to use existing open source TTS systems and existing synthesizers which have all the things covered. You can check festvox or Openmary.

answered Oct 25 '17 at 08:43

Nikolay Shmyrev

24,897
5
43
87

Thanks..I think my problem is much simpler than a full TTS. I'm always joining words with spaces around them. – Ran Oct 25 '17 at 09:11
3

Silence between words always sound unnatural, it is very rare in real speech. If you want to synthesize the natural speech and you actually care about your users you should join words continuously. – Nikolay Shmyrev Oct 29 '17 at 00:20
Thanks. What do you mean by joining continuously? Pairs of words? @nikolay – Ran Oct 29 '17 at 08:39
You can check the algorithm description - it overlaps recordings for smooth transitions between words. – Nikolay Shmyrev Nov 02 '17 at 23:46

A. STEFANI · Answer 2 · 2017-11-03T03:57:40.707

Human is spelling phone number by blocks of number.

Usually block will contain between 1 and 4 numbers and sometimes a phone number will combine different size type.

In order to generate something that spell a phone number like a natural voice, you need to define at least two different silences variable:

dtNumber = silence applied beetween two numbers in a block
dtBlock = silence applied between two blocks of numbers

First split the phone number as a block list:

01-12-13-14-15 => [01,12,13,14,15]

1-888-452-1505 => [1,888,452,1505]

Iterate over all blocks (waiting dtBlock seconds beetween two of them):

&

Iterate over each block's number (waiting dtNumber seconds).

If you apply something like dtBlock >= 2 x (dtNumber), you will have a sound file which look like natural.

Algorithm for concatenating speech audio to sound continuous?

2 Answers2