1

I am trying to make a voice assistant that has a voice used by Google Cloud TTS. I followed every step on a youtube video and I seem to be the only one with this problem. But everytime I run, the audio output/voice comes out as squeaky or super high pitched.

I have tried other modules like pyAudio, sounddevice, pydub, etc. But no luck. I have also tried to adjust the pitch, frequency, rate, etc. But nothing got rid of the squeaky voice it gives. I was expecting it to be like the video, as all the comments have other people doing it with no problem. Any help would be much appreciated

**The wav files are 24000 hz and the generated file sounds correct. But not when processed through pygames it seems

def unique_languages_from_voices(voices):
    language_set = set()
    for voice in voices:
        for language_code in voice.language_codes:
            language_set.add(language_code)
    return language_set


def list_languages():
    client = tts.TextToSpeechClient()
    response = client.list_voices()
    languages = unique_languages_from_voices(response.voices)

    print(f" Languages: {len(languages)} ".center(60, "-"))
    for i, language in enumerate(sorted(languages)):
        print(f"{language:>10}", end="\n" if i % 5 == 4 else "")



def list_voices(language_code=None):
    client = tts.TextToSpeechClient()
    response = client.list_voices(language_code=language_code)
    voices = sorted(response.voices, key=lambda voice: voice.name)

    print(f" Voices: {len(voices)} ".center(60, "-"))
    for voice in voices:
        languages = ", ".join(voice.language_codes)
        name = voice.name
        gender = tts.SsmlVoiceGender(voice.ssml_gender).name
        rate = voice.natural_sample_rate_hertz
        print(f"{languages:<8} | {name:<24} | {gender:<8} | {rate:,} Hz")


def text_to_wav(voice_name: str, text: str):
    language_code = "-".join(voice_name.split("-")[:2])
    text_input = tts.SynthesisInput(text=text)
    voice_params = tts.VoiceSelectionParams(
        language_code=language_code, name=voice_name
    )
    audio_config = tts.AudioConfig(audio_encoding=tts.AudioEncoding.LINEAR16)

    client = tts.TextToSpeechClient()
    response = client.synthesize_speech(
        input=text_input, voice=voice_params, audio_config=audio_config
    )


    filename = f"{voice_name}.wav"
    with open(filename, "wb") as out:
        out.write(response.audio_content)
        print(f'Generated speech saved to "{filename}"')

    return response.audio_content

list_languages()
list_voices("en")

generated_speech = text_to_wav('en-US-News-K', 'Make yourself comfortable, Hacker. Stay a while.')
pygame.mixer.init(frequency=24000, buffer = 2048)
speech_sound = pygame.mixer.Sound(generated_speech)
speech_sound.play()
time.sleep(5)
pygame.mixer.quit()
Cluii
  • 11
  • 2
  • Did you try change the frequency parameter in pygame.mixer? – Gonzalo Odiard Apr 04 '23 at 12:38
  • Your code sets a 12KHz sample rate in `pygame.mixer.init`. Unless your source audio is encoded with the same sample rate, this will result in the audio being played at the wrong speed and frequency. Usual sample rates are 44.1KHz (CD audio) and 48KHz (broadcast wave format). Most media players should be able to show you the sample rate of your input files. – l4mpi Apr 04 '23 at 12:58
  • Yes, I tried it with 24000 hz and others. Same result – Cluii Apr 04 '23 at 12:59
  • @Cluii what sample rate are the original files? Check that, then use the same value. – l4mpi Apr 04 '23 at 13:01
  • When I print list of voices, it says its 24000 hz. I have it on 24000 and it sounds the same. en-US | en-US-News-K | FEMALE | 24,000 Hz – Cluii Apr 04 '23 at 13:03
  • Maybe the generated audio is already crap, in which case it wouldn't matter much what pygame does afterwards. Does the generated wav file sound good when played in an external media player? – l4mpi Apr 04 '23 at 13:05
  • Yeah, that's what has me puzzled. I checked the wav file after it was generated. And it sounds perfect. But whenever processed back into pygame it doesn't sound good – Cluii Apr 04 '23 at 13:08
  • @Cluii I would double check in the external media player that the wav file really has a 24KHz sample rate, maybe it's reported incorrectly. Also check the bit dept, it's probably 16 bit which is the pygame default, but if it's not then add a `size` parameter to the mixer init call with the appropriate value. There is also an `allowedchanges` parameter which allows for on the fly format conversions, that might influence the sound as well - see [the docs](https://www.pygame.org/docs/ref/mixer.html#pygame.mixer.init) for details on the possible values. – l4mpi Apr 04 '23 at 13:29
  • I double-checked the sample rate and other parameters. Adding channels=1 and allowchanges=AUDIO_ALLOW_FREQUENCY_CHANGE made the rate of the voice slower, but the pitch is the same. When testing the pitch, it shows to be ~250 hz. I think that is the problem since it's abnormally high. I'm trying to adjust the pitch of the output but it's throwing errors (tried using pydub and librosa). I'm not sure how to properly lower the pitch with my current code – Cluii Apr 04 '23 at 15:03

1 Answers1

0

After some serious digging, I found a fix to my issue. Credit goes to kasyc and IanHacker from 2011.

Below is an in-depth guide for the answer so it's easier for anyone else who has the same issue.

Firstly, you need to install libsndfile for samplerate on your local device. I have windows 64, so that link can be found here: win64

After installing, install samplerate into your terminal.

pip install samplerate

Then import samplerate as such:

from samplerate import resample

Once everything is installed and setup, we can fix the end code to this:

generated_speech = text_to_wav('en-US-News-K', 'Make yourself comfortable, Hacker. Stay a while.')
pygame.mixer.init(frequency=24000, buffer = 512, allowedchanges=pygame.AUDIO_ALLOW_FREQUENCY_CHANGE, channels=1)
speech_sound = pygame.mixer.Sound(generated_speech)
snd_array = pygame.sndarray.array(speech_sound)
snd_resample = resample(snd_array, 1.8, "sinc_fastest").astype(snd_array.dtype)
snd_out = pygame.sndarray.make_sound(snd_resample)
snd_out.play()
time.sleep(5)
pygame.mixer.quit()

This resamples the sound made by generated_speech to return a 'numpy.int16' array. Once the sample is played back correctly, you will have to manually adjust the ratio and parameters found in the `pygame.mixer.init' and 'snd_resample = resample' to your liking. The code above is what I used to make it sound perfect, it may be different for you. Each adjustment to the ratio and parameters adjusts the rate of speed and pitch for the voice.

Finally, once you have adjusted the parameters to your liking. The sound output should be fixed!

Cluii
  • 11
  • 2