What characteristics should have a .wav file as result of TTS engine to be be listened with high quality?

Question

I'm trying to generate high quality voice-over using Microsoft Speech API. What kind of values I should pass in to this constructor to guarantee high quality audio?

The .wav file will be used latter to feed FFmpeg, so audio will be re-encoded latter to a more compact form. My main goal is keep the voice as clear as I can, but I really don't know which values guarantee the best quality perceived by humans.

Alexander · Accepted Answer · 2014-10-28T13:46:52.453

First of all, just to let you know I haven't used this Speech API, I'll give you an answer based on my Audio processing work.....

You can choose EncodingFormat.Pcm for Pulse Code Modulation
samplesPerSecond is sampling frequency. Because it is voice you can cover it with 16000hz for sure. If you are really perfectionist you can go with 22050 for example. Higher the value is, the audio file size will be larger. If file size isn't a problem you can even go with 32000 or 44100 but there won't be much noticable difference....
bitsPerSample - go with 16 if possible
1 or 2, mono or stereo ..... it won't affect the quality of the sound
averageBytesPerSecond ..... this would be samplesPerSecond*bytesPerSample (for example 22050*2)
blockAlign ..... this would be Bytes Per Sample*numberOfChanels (for example if you have 16bit PCM Mono audio, 16bits are 2 bytes, Mono is 1, so blockAlign is 2*1)
That last one, the byte array doesn't speaks much for itself, I'm not sure what it serves for, I believe the first 6 arguments are enough for audio to be generated.

I hope this was helpful Cheers

Thanks a lot, second point was very useful cos is the part where I have less knowledge and more doubts ;) — , Sep 28 '14 at 23:40
For the second point, you should know that the frequency range of the real recorded sound is half of the sampling frequency.... so 16000hz sampling frequency means that only sounds from 0 to 8000hz are recorded. Because human voice can in theory reach 8000hz max, that's why I said you can go with 16000hz. Normaly, a voice is much lower than 8000hz, but don't try to go lower than 16khz, because it's small number of samples per second for precise recreation of the waveform when playing the sound... — Alexander, Sep 29 '14 at 00:05

What characteristics should have a .wav file as result of TTS engine to be be listened with high quality?

1 Answers1