3

I have this 5-5-2 backpropagation neural network I'm training, and after reading this awesome article by LeCun I started to put in practice some of the ideas he suggests.

Currently I'm evaluating it with a 10-fold cross-validation algorithm I made myself, which goes basically like this:

for each epoch      
  for each possible split (training, validation)
    train and validate
  end
  compute mean MSE between all k splits
end

My inputs and outputs are standardized (0-mean, variance 1) and I'm using a tanh activation function. All network algorithms seem to work properly: I used the same implementation to approximate the sin function and it does it pretty good.

Now, the question is as the title implies: should I standardize each train/validation set separately or do I simply need to standardize the whole dataset once?

Note that if I do the latter, the network doesn't produce meaningful predictions, but I prefer having a more "theoretical" answer than just looking at the outputs.

By the way, I implemented it in C, but I'm also comfortable with C++.

mp85
  • 422
  • 3
  • 17

2 Answers2

7

You will most likely be better off standardizing each training set individually. The purpose of cross-validation is to get a sense for how well your algorithm generalizes. When you apply your network to new inputs, the inputs will not be ones that were used to compute your standardization parameters. If you standardize the entire data set at once, you are ignoring the possibility that a new input will fall outside the range of values over which you standardized.

So unless you plan to re-standardize every time you process a new input (which I'm guessing is unlikely), you should only compute the standardization parameters for the training set of the partition being evaluated. Furthermore, you should compute those parameters only on the training set of the partition, not the validation set (i.e., each of the 10-fold partitions will use 90% of the data to calculate standardization parameters).

bogatron
  • 18,639
  • 6
  • 53
  • 47
  • Thanks for your answer. So what about the validation set? Do I have to standardize the inputs/outputs with the standardization parameters computed for the training partition? I'm asking this because my raw outputs are really big numbers, leaving them as-is would make the tanh saturate. – mp85 Feb 11 '14 at 20:24
  • Yes, you would calculate the standardization parameters from your training set, then use those parameters to standardize the inputs from your validation set. – bogatron Feb 11 '14 at 20:27
3

So you assume the inputs are normally distribution and are subtracting the mean, dividing by standard deviation, to get N(0,1) distributed inputs?

Yes I agree with @bogatron that you standardize each training set separately, but I would more strongly say it's a "must" to not use the validation set data too. The problem is not values outside the range in the training set; this is fine, the transformation to a standard normal is still defined for any value. You can't compute mean / standard deviation overa ll the data because you can't in any way use the validation data in the training set, even if just via this statistic.

It should further be emphasized that you use the mean from the training set with the validation set, not the mean from the validation set. It has to be the same transformation of features that was used during training. It would not be valid to transform the validation set differently.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • Right now I'm doing as you both suggested. I guess the explanation for this is the fact that computing the statistics on the validation set would introduce some sort of "statistical bias" in the validation, is that correct? – mp85 Feb 12 '14 at 02:18
  • And by the way, for the same reason I'm assuming the same argument would apply even if I was scaling my inputs/outputs in the [0,1] interval instead of their z-score. – mp85 Feb 12 '14 at 02:21
  • Yes you don't want to use the validation set at all in training, even in ways that don't seem to be 'leaking' info, because it often can in subtle ways -- that may have very little effect, but, best to have none. Same applies for any transformation based on statistics over the data. Although I would also say you can't necessarily just apply these normalizations to any data. For example, subtracting mean and dividing by standard deviation is meaningless for, say, an exponentially-distributed variable. It wasn't normal to begin with. – Sean Owen Feb 12 '14 at 09:03
  • Yeah, that was clear from the start. In this project I'm working, all input and output variables are normally distributed, so z-score looked like the obvious choice for this particular case. Thanks for the useful reply. – mp85 Feb 12 '14 at 10:58