What are the recommended mel-spectogram normalization techniques for training a neural network aimed at singing voice synthesis? My configuration settings are
n_fft= 2048, hop_length= 512, n_mels = 80
I have implemented normalization using the code below (taken from the whisper repo), but it is not yielding satisfactory results.
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0
I expected the range between 0 and 1, but it isnt generating between 0 and 1. Please suggest some suitable mel-spectogram normalization technique.