I'm currently working on recurrent neural networks for text-to-speech but I'm stuck at one point.
I've some input files and they have characteristic features of text(phonemes etc.) with dimension 490. The output files are mgc(60-d), bap(25-d) and lf0(1-d). mgc and bap files are ok because there are no big gaps between values. I can train them with reasonable time and accuracy. Inputs and outputs are sequential and properly aligned, e.g. if an input is of shape (300, 490), then the shapes of mgc, bap and lf0 are (300, 60), (300, 25) and (300, 1), respectively.
My problem here is with the lf0(log of fundamental frequency, I suppose). The values are like, say, [0.23, 1.2, 0.54, 3.4, -10e9, -10e9, -10e9, 3.2, 0.25]. I tried to train it using MSE but the error is too high and not decreasing at all.
I'd like to hear any suggestion for this problem. I'm open to anything.
PS: I'm using 2 gru layers with 256 or 512 unit of each.