0

I've currently been working with CNN's and am having a hard time with what I believe is overfitting. Specifically, even though my training data converges to a minimum error, my validation data still refuses to drop in respect to error. My input data that I'm using is 512 x 650 x 1 x 4000 (2D data, 4000 samples) and there are only two classes to the data that I'm trying to distinguish between (class A and class B). I'm aware I in the future need many more samples, but for now, I just would like to see my validation error decline even a little before I invest in generating more data.

My networks have all been around 60-70 layers long and have included the following types of layers:

Block Example

Convolutional Layers [3 x 3] filter size, stride [1 x 1], padding [1 1 1 1]

ReLU Layers (Non-linearity)

Batch normalization (Tremendous help to training data convergence and implementation speed)

Max Pooling Layers [2 x 2] filters sizes, stride [2 x 2], padding [0 0 0 0]

I then repeat this "block" until my input data is a 1 x 1 x N size where I then run it through a a few fully connected layers, and then into a softmax.

My actual MatConvNet code is below for inspection and the output plots are attached. For the plots, blue represents my training error and orange represents my validation error. I'm linking my most recent from the code below.

My Questions:

1) How does one know what filter sizes to use for their data? I know its an empirical process, but surely there is some kind of intuition behind this? I've read papers (VGG.net, and more) on using the [3x3] small filters and using a lot of them, but even after designing a 70 layer network with this in mind, still no validation error decline.

2) I have tried dropout layers due to their popularity of reducing over fitting... I placed the dropout layers throughout my network after the ReLU and pooling layers in the "block" shown above, but between all convolutional layers. It unfortunately had no effect on my validation data, the error was still the same. Next I tried only using it after the fully connected layers since thats where the most neurons (or feature maps) are being created in my architecture, and still no luck. I've read the paper on dropout. Should I give up on using it? Is there once again "a trick" to this?

3) If I try a smaller network (I've read that's a descent way to deal with overfitting) how do I effectively reduce the size of my data? Just max pooling?

ANY suggestions would be wonderful.

Again, thank you all for reading this long question. I assure you I've done my research, and found that asking here might help me more in the long run.

CNN Error Output plot

MatConvNet Code (Matlab Toolbox for CNN Design)

opts.train.batchSize = 25;                                          
opts.train.numEpochs = 200 ;                                        
opts.train.continue = true ;                                        
opts.train.gpus = [1] ;                                             
opts.train.learningRate = 1e-3;                                     
opts.train.weightDecay  = 0.04;                                     
opts.train.momentum = 0.9;                                      
opts.train.expDir = 'epoch_data';                                   
opts.train.numSubBatches = 1;                                       

bopts.useGpu = numel(opts.train.gpus) >  0 ;                        

load('imdb4k.mat');                                                 
net = dagnn.DagNN() ;                                               

% Block #1

net.addLayer('conv1', dagnn.Conv('size', [3 3 1 64], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'input'}, {'conv1'},  {'conv1f'  'conv1b'});

net.addLayer('relu1', dagnn.ReLU(), {'conv1'}, {'relu1'}, {});

net.addLayer('bn1', dagnn.BatchNorm('numChannels', 64), {'relu1'}, {'bn1'}, {'bn1f', 'bn1b', 'bn1m'});

net.addLayer('pool1', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn1'}, {'pool1'}, {});

% Block #2

net.addLayer('conv2', dagnn.Conv('size', [3 3 64 64], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool1'}, {'conv2'},  {'conv2f'  'conv2b'});

net.addLayer('relu2', dagnn.ReLU(), {'conv2'}, {'relu2'}, {});

net.addLayer('bn2', dagnn.BatchNorm('numChannels', 64), {'relu2'}, {'bn2'}, {'bn2f', 'bn2b', 'bn2m'});

net.addLayer('pool2', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn2'}, {'pool2'}, {});

% Block #3

net.addLayer('conv3', dagnn.Conv('size', [3 3 64 128], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool2'}, {'conv3'},  {'conv3f'  'conv3b'}); 

net.addLayer('relu3', dagnn.ReLU(), {'conv3'}, {'relu3'}, {});

net.addLayer('bn3', dagnn.BatchNorm('numChannels', 128), {'relu3'}, {'bn3'}, 
{'bn3f', 'bn3b', 'bn3m'});

net.addLayer('pool3', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn3'}, {'pool3'}, {});

% Block #4

net.addLayer('conv4', dagnn.Conv('size', [3 3 128 128], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool3'}, {'conv4'},  {'conv4f'  'conv4b'}); 

net.addLayer('relu4', dagnn.ReLU(), {'conv4'}, {'relu4'}, {});

net.addLayer('bn4', dagnn.BatchNorm('numChannels', 128), {'relu4'}, {'bn4'}, {'bn4f', 'bn4b', 'bn4m'});

net.addLayer('pool4', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn4'}, {'pool4'}, {});

% Block #5

net.addLayer('conv5', dagnn.Conv('size', [3 3 128 256], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool4'}, {'conv5'},  {'conv5f'  'conv5b'});

net.addLayer('relu5', dagnn.ReLU(), {'conv5'}, {'relu5'}, {});

net.addLayer('bn5', dagnn.BatchNorm('numChannels', 256), {'relu5'}, {'bn5'}, {'bn5f', 'bn5b', 'bn5m'});

net.addLayer('pool5', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn5'}, {'pool5'}, {});

% Block #6

net.addLayer('conv6', dagnn.Conv('size', [3 3 256 256], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool5'}, {'conv6'},  {'conv6f'  'conv6b'}); 

net.addLayer('relu6', dagnn.ReLU(), {'conv6'}, {'relu6'}, {});

net.addLayer('bn6', dagnn.BatchNorm('numChannels', 256), {'relu6'}, {'bn6'}, {'bn6f', 'bn6b', 'bn6m'});

net.addLayer('pool6', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn6'}, {'pool6'}, {});

% Block #7

net.addLayer('conv7', dagnn.Conv('size', [3 3 256 512], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool6'}, {'conv7'},  {'conv7f'  'conv7b'});

net.addLayer('relu7', dagnn.ReLU(), {'conv7'}, {'relu7'}, {});

net.addLayer('bn7', dagnn.BatchNorm('numChannels', 512), {'relu7'}, {'bn7'}, {'bn7f', 'bn7b', 'bn7m'});

net.addLayer('pool7', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn7'}, {'pool7'}, {});

% Block #8

net.addLayer('conv8', dagnn.Conv('size', [3 3 512 512], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool7'}, {'conv8'},  {'conv8f'  'conv8b'}); 

net.addLayer('relu8', dagnn.ReLU(), {'conv8'}, {'relu8'}, {});

net.addLayer('bn8', dagnn.BatchNorm('numChannels', 512), {'relu8'}, {'bn8'}, {'bn8f', 'bn8b', 'bn8m'});

net.addLayer('pool8', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [1 2], 'pad', [0 0 0 0]), {'bn8'}, {'pool8'}, {});

% Block #9

net.addLayer('conv9', dagnn.Conv('size', [3 3 512 512], 'hasBias', true, 'stride', [1, 1], 'pad', [1 1 1 1]), {'pool8'}, {'conv9'},  {'conv9f'  'conv9b'});

net.addLayer('relu9', dagnn.ReLU(), {'conv9'}, {'relu9'}, {});

net.addLayer('bn9', dagnn.BatchNorm('numChannels', 512), {'relu9'}, {'bn9'}, {'bn9f', 'bn9b', 'bn9m'});

net.addLayer('pool9', dagnn.Pooling('method', 'max', 'poolSize', [2, 2], 'stride', [2 2], 'pad', [0 0 0 0]), {'bn9'}, {'pool9'}, {});

% Incorporate MLP

net.addLayer('fc1', dagnn.Conv('size', [1 1 512 1000], 'hasBias', true, 'stride', [1, 1], 'pad', [0 0 0 0]), {'pool9'}, {'fc1'},  {'conv15f'  'conv15b'});

net.addLayer('relu10', dagnn.ReLU(), {'fc1'}, {'relu10'}, {});

net.addLayer('bn10', dagnn.BatchNorm('numChannels', 1000), {'relu10'}, {'bn10'}, {'bn10f', 'bn10b', 'bn10m'});

net.addLayer('classifier', dagnn.Conv('size', [1 1 1000 2], 'hasBias', true, 'stride', [1, 1], 'pad', [0 0 0 0]), {'bn10'}, {'classifier'},  {'conv16f'  'conv16b'});

net.addLayer('prob', dagnn.SoftMax(), {'classifier'}, {'prob'}, {});

% The dagnn.Loss computes the loss incurred by the prediction scores X given the categorical labels

net.addLayer('objective', dagnn.Loss('loss', 'softmaxlog'), {'prob', 'label'}, {'objective'}, {});

net.addLayer('error', dagnn.Loss('loss', 'classerror'), {'prob','label'}, 'error') ;
Charles
  • 43
  • 1
  • 7

1 Answers1

0

First of all your network seems too complex to me for the data and you need two orders of magnitude more samples to see any kind of results on such a complex network. That if the problem itself is that complex. Try to see if results improve with a much smaller network. Answering your questions:

1)Filter sizes ARE empirical but popularly 1x1,3x3,5x5 filters are most used. A 70 layer network does not make sense unless the problem is very complex and you have huge data. Also for that to successfully train you might have to look into resnets.

2)Dropouts are most often used in the Fully Connected layers. You can look into dropconnect. No need to use them between conv layers in general.

3)Reduction of size of intermediate maps can be easily achieved by the conv + maxpooling stacks. You dont have to reduce it to size 1x1 before using a MLP. You can use it directly by the time the maps reach a size 8x8 in the network for example. Try to use more than one FC layer. Also reduce the width of the network to reduce the model complexity (number of filters per layer)

Overall you have very little data which is definitely not going to work for deep models. Finetuning a pretrained model might give you better results. It all depends on the data itself and the task at hand. Also remember networks like VGG are trained for 1000 different classes with Millions of Images which is a very complex problem.

  • Okay, this is helpful information! Thank you sir. So for a smaller network, what would you recommend as the number of layers? I understand that it will ultimately depend on my data, but maybe something around 30 layers deep will suffice? Also, all the examples I've seen have always reduced the size to a 1 x 1, so I just assumed there was some theory behind it. Lastly, how many data samples is an average amount? 50,000? 100,000? – Charles Jun 06 '17 at 16:51
  • One more question: Would a small network that has less complexity just follow a simple, conv, relu, batch norm, pooling format iteration? And for the conv filters, could I still employ [5 x 5] size filters? Or should I look into much larger ones? Such as [15 x 15]. Lastly, you made a comment that I should reduce the width of my network. By width, did you mean neurons per layer? That would be the same as the number of feature maps produced by the layer, which in my mind is "width". Is that what you were saying? – Charles Jun 06 '17 at 17:01
  • Since you are beginning I would suggest you study the architectures of LeNet, AlexNet, VGGnets, Inception GoogleNet and Resnets. See which datasets they worked on and how many layers with respect to what complexity of problem. Also explicitly manually calculate each layer size and see what size it is being reduced down to. AlexNet for example had feature maps of size 13x13x128 i think before the first Fully Connected layer. – jashojit mukherjee Jun 07 '17 at 05:19
  • There is no average amount in terms of data. This answer maybe of use [link](https://stats.stackexchange.com/questions/127111/how-much-data-do-you-need-for-a-convolutional-neural-network) . Also feel free to experiment with filter sizes but I have not seen any successful networks use that large filters. Its usually better to stack a few 3x3 or 5x5 rather than one single larger filter. And yes by width I did mean number of neurons/kernels per conv layer. – jashojit mukherjee Jun 07 '17 at 05:22
  • Thank you again! I will definitely make use of that advice. One more thing, I linked a picture of the output plot that was generated from my CNN training. Have you ever used MatConvNet? I'm having a hard time truley interpreting that graph. – Charles Jun 08 '17 at 01:24
  • Specifically, how is one supposed to interpret the graph output for the DAG wrapper? There is absolutely minimum documentation on this. What does energy vs epochs mean? What is the difference between ERROR and OBJECTIVE? Does MatConvNet try to minimize objective? How does MatConvNet handle these output graphs in DAG wrapper vs Simple Wrapper? – Charles Jun 08 '17 at 01:26
  • I am unsure about this. Had only used MatConvNet pre DAG at its nascent stage. Now I am using tensorflow and caffe mostly. – jashojit mukherjee Jun 12 '17 at 05:20