How to design an optimal CNN?

Question

I am working on a Ph.D. project, which objective is to reduce CO2 emissions on Earth.

I have a dataset, and I was able to successfully implement a CNN, which gives 80% accuracy (worst-case scenario). However, the field where I work is very demanding, and I have the impression that I could get better accuracy with a well-optimized CNN.

How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless. Are there any real-life examples where this problem is navigated, addressing its full complexity (not just optimizing a few hyper-parameters)?

Hopefully, my dataset is not too large, so the CNN models that I am considering should have very few parameters.

Designing a CNN could be considered as an art, it requires expertise and knowledge. There is no formal way to design a CNN. You can start with some pre-trained models like InceptionV3 or ResNet, change their parameters, or use only the top N layers. Hyperparameter optimization could be performed to find the best combination of hyperparameters. You can obtain this knowledge from research papers as well, which tend to solve a problem similar to yours. — Shubham Panchal, Apr 11 '21 at 02:12
I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). — desertnaut, Apr 11 '21 at 09:06
@desertnaut I'm curious to know then how come this question still open and answered. ?? [How to design deep convolutional neural networks?](https://stackoverflow.com/questions/37280910/how-to-design-deep-convolutional-neural-networks/37283058) — Innat, Apr 11 '21 at 09:36
@M.Innat it's simply because we are all volunteers here, and we have only that much time and resources to spare, so such posts always slip. Thank you for bringing it into our attention and stay tuned - it will also be closed soon. — desertnaut, Apr 11 '21 at 09:41
@desertnaut I see you now voted for closing that question. That makes sense, thanks. However, still, it's a quite surprise that it's been 4 years (12k view) and nobody did that. — Innat, Apr 11 '21 at 09:42
@M.Innat you are an established user here; do you vote for closing questions yourself? If not, you have at least a (partial) answer. As said, the relevant resources are (extremely) limited, and there will *always* be posts that remain open without in fact conforming to the site rules & guidelines. That is not a reason for not doing what it takes for any such post that comes into our attention (like this one here). — desertnaut, Apr 11 '21 at 09:46
@M.Innat in fact, there was even a recent discussion at Meta to [gift wrap our (good) machine learning theory questions for Cross Validated](https://meta.stackoverflow.com/questions/404799/lets-gift-wrap-our-good-machine-learning-theory-questions-for-cross-validated) — desertnaut, Apr 11 '21 at 09:48
@M.Innat linked question is now also closed. Hope the situation is clear now, and please keep in mind the site rules & guidelines when asking & answering (I have done the mistake of answering such Qs myself in the past, when these rules were still not very clear in my own head). — desertnaut, Apr 11 '21 at 09:53
@desertnaut yes, I understand. I will keep this in mind. Thank you. -) — Innat, Apr 11 '21 at 12:05

Innat · Accepted Answer · 2021-04-11T07:21:27.007

How do experts design CNN's? How could I choose between Inception Modules, Dropout Regularization, Batch Normalization, convolutional filter size, size and depth of convolutional channels, number of fully-connected layers, activations neurons, etc? How do people navigate this large optimization problem in a scientific manner? The combinations are endless.

You said truly that the combinations are huge in number. And without approaching rightly you may end up with nowhere. A great one said machine Learning is an art, not science. Results are data-dependent. Here are a few tips regarding your above concern.

Log Everything: In the training time, save necessary logs of every experiment such as training loss, validation loss, weight files, execution times, visualization, etc. Some of them can be saved with CSVLogger, ModelCheckpoint etc. TensorBoard is a great tool for inspecting both training log and visualization and many more.
Strong Validation Strategies: This is very important. To build a stable Cross-Validation (CV), we must have a good understanding of the data and the challenges faced. We’ll check and make sure the validation set has a similar distribution to the training set and test set. And We’ll try to make sure our models improve both on our CV and on the test set (if gt is available for the test set). Basically, partitioning the data randomly is usually not enough to satisfy this. Understanding the data and how we can partition it without introducing a data leakage in our CV is key to avoid overfitting.
Change Only One: During the experiment, change one thing at a time and save the observations (logs) for those changes. For example: change the image size gradually from 224 (for example) to higher and observe the results. We should start with a small combination. While experimenting with image size, fix others like model architecture, learning rate, etc. The same goes for the learning rate part or model architectures. However, later we also may need to change more than one when we get some promising combinations. In kaggle competition, these are very common approaches one would follow. Below is a very simple example regarding this. But it's not limited any way.

However, as you said, your Ph.D. project is to reduce CO2 emissions on Earth. In my understanding, these are more application-specific problems and less than the algorithm-specific problems. So, we think it's better to take benefit from well-recognized pre-trained models.

In case if we wish to write our CNN on our own, we should give a decent time on it. Start with a very simple one, for example:

Conv2D (16,  3, 'relu') - > MaxPool (2)
Conv2D (32,  3, 'relu') - > MaxPool (2) 
Conv2D (64,  3, 'relu') - > MaxPool (2)
Conv2D (128, 3, 'relu') - > MaxPool (2)

Here we gradually increase the depth but reducing the feature dimension. By the end layer, more semantic information would emerge. While stacking Conv2D layers, it's common practice to increase the channel depth in such order 16, 32, 64, 128 etc. If we want to impute Inception or Residual Block inside our network, I think, we should do some basic math first about what feature properties will come out of this, etc. Following a concept like this, we may also wish to look at approaches like SENet, ResNeSt etc. About Dropout, if we observe that our model is getting overfitted during training, then we should add some. In the final layer, we may want to choose GlobalAveragePooling over the Flatten layer (FCC). We can probably now understand that there are lots of ablation studies that need to be done to get a satisfactory CNN model.

In this regard, We suggest you explore the two most important things: (1). Read one of the pre-trained model papers/blogs/videos about their strategies to build the algorithm. For example: check out this EfficientNet Explained. (2). Next, explore the source code of it. That would give your more sense and encourage you to build your own giant.

We like to end this with one last working example. See the model diagram below, it's a small inception network, source. If we look closely, we will see, it consists of the following three modules.

Conv Module
Inception Module
Downsample Modul

Take a close look at each module's configuration such as filter size, strides, etc. Let's try to understand and implement this module. Before that, here are two good references (1, 2) for the Inception concept to refresh the concept.

Conv Module

From the diagram we can see, it consists of one convolutional network, one batch normalization, and one relu activation. Also, it produces C times feature maps with K x K filters and S x S strides. To do that, we will create a class object that will inherit the tf.keras.layers.Layer classes

class ConvModule(tf.keras.layers.Layer):
    def __init__(self, kernel_num, kernel_size, strides, padding='same'):
        super(ConvModule, self).__init__()
        # conv layer
        self.conv = tf.keras.layers.Conv2D(kernel_num, 
                        kernel_size=kernel_size, 
                        strides=strides, padding=padding)

        # batch norm layer
        self.bn   = tf.keras.layers.BatchNormalization()


    def call(self, input_tensor, training=False):
        x = self.conv(input_tensor)
        x = self.bn(x, training=training)
        x = tf.nn.relu(x)
        
        return x

Inception Module

Next comes the Inception module. According to the above graph, it consists of two convolutional modules and then merges together. Now as we know to merge, here we need to ensure that the output feature maps dimension ( height and width ) needs to be the same.

class InceptionModule(tf.keras.layers.Layer):
    def __init__(self, kernel_size1x1, kernel_size3x3):
        super(InceptionModule, self).__init__()
        
        # two conv modules: they will take same input tensor 
        self.conv1 = ConvModule(kernel_size1x1, kernel_size=(1,1), strides=(1,1))
        self.conv2 = ConvModule(kernel_size3x3, kernel_size=(3,3), strides=(1,1))
        self.cat   = tf.keras.layers.Concatenate()


    def call(self, input_tensor, training=False):
        x_1x1 = self.conv1(input_tensor)
        x_3x3 = self.conv2(input_tensor)
        x = self.cat([x_1x1, x_3x3])
        return x

Here you may notice that we are now hard-coded the exact kernel size and strides number for both convolutional layers according to the network (diagram). And also in ConvModule, we have already set padding to the same, so that the dimension of the feature maps will be the same for both (self.conv1 and self.conv2); which is required in order to concatenate them to the end.

Again, in this module, two variable performs as the placeholder, kernel_size1x1, and kernel_size3x3. This is for the purpose of course. Because we will need different numbers of feature maps to the different stages of the entire model. If we look into the diagram of the model, we will see that InceptionModule takes a different number of filters at different stages in the model.

Downsample Module

Lastly the downsampling module. The main intuition for downsampling is that we hope to get more relevant feature information that highly represents the inputs to the model. As it tends to remove the unwanted feature so that model can focus on the most relevant. There are many ways we can reduce the dimension of the feature maps (or inputs). For example: using strides 2 or using the conventional pooling operation. There are many types of pooling operation, namely: MaxPooling, AveragePooling, GlobalAveragePooling.

From the diagram, we can see that the downsampling module contains one convolutional layer and one max-pooling layer which later merges together. Now, if we look closely at the diagram (top-right), we will see that the convolutional layer takes a 3 x 3 size filter with strides 2 x 2. And the pooling layer (here MaxPooling) takes pooling size 3 x 3 with strides 2 x 2. Fair enough, however, we also ensure that the dimension coming from each of them should be the same in order to merge at the end. Now, if we remember when we design the ConvModule we purposely set the value of the padding argument to same. But in this case, we need to set it to valid.

class DownsampleModule(tf.keras.layers.Layer):
    def __init__(self, kernel_size):
        super(DownsampleModule, self).__init__()

        # conv layer
        self.conv3 = ConvModule(kernel_size, kernel_size=(3,3), 
                         strides=(2,2), padding="valid") 

        # pooling layer 
        self.pool  = tf.keras.layers.MaxPooling2D(pool_size=(3, 3), 
                         strides=(2,2))
        self.cat   = tf.keras.layers.Concatenate()

    def call(self, input_tensor, training=False):
        # forward pass 
        conv_x = self.conv3(input_tensor, training=training)
        pool_x = self.pool(input_tensor)
    
        # merged
        return self.cat([conv_x, pool_x])

Okay, now we have built all three modules, namely: ConvModule InceptionModule DownsampleModule. Let's initialize their parameter according to the diagram.

class MiniInception(tf.keras.Model):
    def __init__(self, num_classes=10):
        super(MiniInception, self).__init__()

        # the first conv module
        self.conv_block = ConvModule(96, (3,3), (1,1))

        # 2 inception module and 1 downsample module
        self.inception_block1  = InceptionModule(32, 32)
        self.inception_block2  = InceptionModule(32, 48)
        self.downsample_block1 = DownsampleModule(80)
  
        # 4 inception module and 1 downsample module
        self.inception_block3  = InceptionModule(112, 48)
        self.inception_block4  = InceptionModule(96, 64)
        self.inception_block5  = InceptionModule(80, 80)
        self.inception_block6  = InceptionModule(48, 96)
        self.downsample_block2 = DownsampleModule(96)

        # 2 inception module 
        self.inception_block7 = InceptionModule(176, 160)
        self.inception_block8 = InceptionModule(176, 160)

        # average pooling
        self.avg_pool = tf.keras.layers.AveragePooling2D((7,7))

        # model tail
        self.flat      = tf.keras.layers.Flatten()
        self.classfier = tf.keras.layers.Dense(num_classes, activation='softmax')


    def call(self, input_tensor, training=True, **kwargs):
        # forward pass 
        x = self.conv_block(input_tensor)
        x = self.inception_block1(x)
        x = self.inception_block2(x)
        x = self.downsample_block1(x)

        x = self.inception_block3(x)
        x = self.inception_block4(x)
        x = self.inception_block5(x)
        x = self.inception_block6(x)
        x = self.downsample_block2(x)

        x = self.inception_block7(x)
        x = self.inception_block8(x)
        x = self.avg_pool(x)

        x = self.flat(x)
        return self.classfier(x)

The amount of filter number for each computational block is set according to the design of the model (see the diagram). After initialing all the blocks (in the __init__ function), we connect them according to the design (in the call function).

Thanks for the feedback @M.Innat. That is very helpful, I'm impressed by the depth of the answer. I will try to apply the strategies you mentioned. It seems that the CNN optimization process will take a few months :) — C-3PO, Apr 11 '21 at 13:15

score 1 · Answer 2 · answered Apr 11 '21 at 03:52

I think you are way off on your estimate of the number of parameters needed. Think more like a few million which is what you will get if you use transfer learning. You can struggle trying to make your own model if you wish but you will probable not be any better (and more likely no where near as good) as the results you will get from transfer learning. I highly recommend the MobileV2 model. Now you can make that or any of the other models perform better if you use an adjustable learning rate using ReduceLROnPlateau . Documentation for that is here. The other thing I recommend is to use the Keras callback EarlyStopping. Documentation is here. . Set it to monitor validation loss and set restore_best_weights=True. Set the number of epochs to a large number so this callback gets triggered and returns the model with the weights from the epoch with the lowest validation loss. My recommended code is shown below

height=224
width=224
img_shape=(height, width, 3)
dropout=.3
lr=.001
class_count=156 # number of classes
img_shape=(height, width, 3)
base_model=tf.keras.applications.MobileNetV2( include_top=False, input_shape=img_shape, pooling='max', weights='imagenet') 
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(512, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
                bias_regularizer=regularizers.l1(0.006) ,activation='relu', kernel_initializer= tf.keras.initializers.GlorotUniform(seed=123))(x)
x=Dropout(rate=dropout, seed=123)(x)        
output=Dense(class_count, activation='softmax',kernel_initializer=tf.keras.initializers.GlorotUniform(seed=123))(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(Adamax(lr=lr), loss='categorical_crossentropy', metrics=['accuracy']) 
rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)
estop=tf.keras.callbacks.EarlyStopping( monitor="val_loss", min_delta=0, patience=4,
                                       verbose=1,  mode="auto",  baseline=None,
                                        restore_best_weights=True)
callbacks=[rlronp, estop]

Also look at the balance in your data set. That is, compare how many training samples you have for each class. If the ratio of most samples/least samples>2 or 3 you may want to take action to mitigate that. Numerous methods are available, the simplest is to use the class_weight parameter in model.fit. o do that you need to create a class_weights dictionary. The process to do that is outline below

Lets say your class distribution is
 class0 - 500 samples
 class1- 2000 samples
 class2 - 1500 samples
 class3 - 200 samples
Then your dictionary would be
class_weights={0: 2000/500, 1:2000/2000, 2: 2000/1500, 3: 2000/200}
in model.fit set class_weight=class_weights

Thanks for the feedback @Gerry P, it was very helpful. I will try the methods you mentioned. However, I think that I do need to design a small customized CNN, we don't have that much training data, and the problem is highly different to traditional Computer Vision tasks. — C-3PO, Apr 12 '21 at 09:48

How to design an optimal CNN?

2 Answers2

Conv Module

Inception Module

Downsample Module

Linked