1

What are the recommended mel-spectogram normalization techniques for training a neural network aimed at singing voice synthesis? My configuration settings are n_fft= 2048, hop_length= 512, n_mels = 80

I have implemented normalization using the code below (taken from the whisper repo), but it is not yielding satisfactory results.

    log_spec = torch.clamp(mel_spec, min=1e-10).log10()
    log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
    log_spec = (log_spec + 4.0) / 4.0

I expected the range between 0 and 1, but it isnt generating between 0 and 1. Please suggest some suitable mel-spectogram normalization technique.

0 Answers0