2

I'm currently working on recurrent neural networks for text-to-speech but I'm stuck at one point.

I've some input files and they have characteristic features of text(phonemes etc.) with dimension 490. The output files are mgc(60-d), bap(25-d) and lf0(1-d). mgc and bap files are ok because there are no big gaps between values. I can train them with reasonable time and accuracy. Inputs and outputs are sequential and properly aligned, e.g. if an input is of shape (300, 490), then the shapes of mgc, bap and lf0 are (300, 60), (300, 25) and (300, 1), respectively.

My problem here is with the lf0(log of fundamental frequency, I suppose). The values are like, say, [0.23, 1.2, 0.54, 3.4, -10e9, -10e9, -10e9, 3.2, 0.25]. I tried to train it using MSE but the error is too high and not decreasing at all.

plot of lf0

I'd like to hear any suggestion for this problem. I'm open to anything.

PS: I'm using 2 gru layers with 256 or 512 unit of each.

MGoksu
  • 510
  • 6
  • 13
  • 2
    Have you tried applying a log transformation to your output (taking a log(y) instead of y)? – Marcin Możejko Jul 21 '16 at 10:54
  • @Marcin Możejko I thought it was already log'ed but I'll give it a try and update – MGoksu Jul 21 '16 at 11:37
  • @MarcinMożejko I did but there are negative numbers in y such as -10e9 so log(y) yielded complex numbers which I don't know if it's possible to train for complex numbers – MGoksu Jul 21 '16 at 12:06
  • You can also try using the [`min-max scaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) which normalizes your data by scaling in the range [0,1], although the Standard Deviation obviously becomes smaller. – Nickil Maveli Jul 21 '16 at 12:22
  • I would try an sgn(x) * abs(x)^(1/n) where n is a big number (like 10) also. – Marcin Możejko Jul 21 '16 at 13:43
  • @MarcinMożejko I tried sgn(x) * abs(x)^(1/n) as well but small errors in training becomes huge after the output is de-normalized – MGoksu Jul 24 '16 at 15:34

0 Answers0