0

I am looking into the effect of the training sample size when doing a ridge (regularised) regression. I get this very strange graph when I plot the test error versus the train set size: plot.

The following code generates a training set and a test set and performs ridge regression for a low value of the regularization parameter.

The error and its standard deviation is plotted against the size of the training set.

Note that the dimension of the generated data is 10.

%settings
samplerange = 8:12;
maxiter = 100;
test = 300;
dimension = 10;
gamma = 10^-5;
rng(2);
figure(1);

error = zeros(maxiter,1);

for samples=samplerange
    for iter=1:maxiter

        % training data
        a = randn(dimension,1);
        xtrain = randn(samples,dimension);
        ytrain = xtrain*a + randn(samples,1);

        % test data
        xtest = randn(test,dimension);
        ytest = xtest*a + randn(test,1);

        % ridge regression
        afit = (xtrain'*xtrain+gamma*length(ytrain)*eye(dimension)) \ xtrain'*ytrain;
        % test error
        error(iter) = (ytest-xtest*afit)'*(ytest-xtest*afit) / length(ytest);
    end

    hold on;
    errorbar(samples, mean(error), std(error), '.');
    hold off;
end

mean(error)

I get the following error values:

   14.0982
   28.1679
  201.4467
   75.4921
   16.2038

and the following standard deviation:

   39.3148
  126.0627
  756.4289
  568.7223
   65.9008

Why is it going up then down? The value is averaged over 100 iterations so this isn't by chance.

I believe it has something to do with the fact that the dimension of the data is 10. It may be computational since the test error should of course decrease as the training set gets bigger...

If any of you can shine a light on what is going on, I'd be grateful!

mkierc
  • 1,193
  • 2
  • 15
  • 28
asachet
  • 6,620
  • 2
  • 30
  • 74
  • Here is link to the plot produced. https://imgur.com/mgm8zMe It represents the test error (with standard deviation) vs the number of samples in the train set. Why would the variance go up and down? – asachet Jan 05 '15 at 15:33

1 Answers1

1

For your iterative process, the only factor that change is your samplerange (from 8 to 12), this should not affect your results by that much since it is only a range.

I think what is causing the huge change in error rate is this process: randn(samples,dimension); Have you looked at the output of this process for each time you use it? randn is a random number generated from a normal distribution, so it could generate some really large numbers that could mess up your results (considering your range is quite small).

You could try modifying the distribution of your random process to see what happens.

GameOfThrows
  • 4,510
  • 2
  • 27
  • 44
  • 'it could generate some really large numbers that could mess up your results' Yes but the result is averaged over 100 iterations so it should not be an issue. I tried with 1000 iterations and the result is the same. And why would really large numbers be consistently generated for samples=10? – asachet Jan 05 '15 at 13:35
  • hmmmmm...I spent about 30 minute testing each stage of your code, it is really strange that this should happen. The afit seems to be very big for samplerange 10, but I cannot work out why it does this for 10 exclusively... – GameOfThrows Jan 05 '15 at 15:29
  • I am almost certain that this is linked to the dimension (which is also 10). When you change the dimension, the bump in the test error moves along... Maybe mldivide (\) behaves differently in this case? Thank you for your time! – asachet Jan 05 '15 at 15:39