Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN

Question

Background:

Tagging TensorFlow since Keras runs on top of it and this is more a general deep learning question.

I have been working on the Kaggle Digit Recognizer problem and used Keras to train CNN models for the task. This model below has the original CNN structure I used for this competition and it performed okay.

def build_model1():
    model = models.Sequential()

    model.add(layers.Conv2D(32, (3, 3), padding="Same" activation="relu", input_shape=[28, 28, 1]))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation="relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(10, activation="softmax"))

    return model

Then I read some other notebooks on Kaggle and borrowed another CNN structure (copied below), which works much better than the one above in that it achieved better accuracy, lower error rate, and took many more epochs before overfitting the training data.

def build_model2():
    model = models.Sequential()

    model.add(layers.Conv2D(32, (5, 5),padding ='Same', activation='relu', input_shape = (28, 28, 1)))
    model.add(layers.Conv2D(32, (5, 5),padding = 'Same', activation ='relu'))
    model.add(layers.MaxPool2D((2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Conv2D(64,(3, 3),padding = 'Same', activation ='relu'))
    model.add(layers.Conv2D(64, (3, 3),padding = 'Same', activation ='relu'))
    model.add(layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)))
    model.add(layers.Dropout(0.25))

    model.add(layers.Flatten())
    model.add(layers.Dense(256, activation = "relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(10, activation = "softmax"))

    return model

Question:

Is there any intuition or explanation behind the better performance of the second CNN structure? What is it that makes stacking 2 Conv2D layers better than just using 1 Conv2D layer before max pooling and dropout? Or is there something else that contributes to the result of the second model?

Thank y'all for your time and help.

score 11 · Accepted Answer · answered Oct 01 '17 at 18:42

The main difference between these two approaches is that the later (2 conv) has more flexibility in expressing non-linear transformations without loosing information. Maxpool removes information from the signal, dropout forces distributed representation, thus both effectively make it harder to propagate information. If, for given problem, highly non-linear transformation has to be applied on raw data, stacking multiple convs (with relu) will make it easier to learn, that's it. Also note that you are comparing a model with 3 max poolings with model with only 2, consequently the second one will potentially loose less information. Another thing is it has way bigger fully connected bit at the end, while the first one is tiny (64 neurons + 0.5 dropout means that you effectively have at most 32 neurons active, that is a tiny layer!). To sum up:

These architectures differe in many aspects, not just stacking conv nets.
Stacking convnets usually leads to less information being lost in processing; see for example "all convolutional" architectures.

Hi Wojciech, first of all, thanks a lot for your explanation and it really helps me to think a bit better about neural networks. Now I understand why the first one performs so poorly in my trial. Also realized that my understanding is quite shallow and should review and read more on the topic. Secondly, big fan of DeepMind :) — Nahua Kang, Oct 01 '17 at 18:51

Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN

1 Answers1