Does SeparableConv2D include Batch Normalisation and Relu after the Depthwise operation?

Question

This question may be similar to this and this but in this question I'd like to clear out the difference between DepthwiseConv2D and SeparableConv2D in terms of the activation layers and the batch normalisation layers between the depthwise and pointwise layers.

The mobilenet paper mentions a Depthwise Separable Convolution layer as something like this:

However does the tf.keras version implement the same architecture in the SeparableConv2D layer? If not then what does the layer look like?

score 2 · Answer 1 · answered Aug 02 '22 at 04:05

Tensorflow Keras' implementation of SeparableConv2D does not include batch normalization with the nonlinearity between the depthwise and pointwise convolution. If you want that functionality, you will need to define DepthwiseConv2D -> BatchNorm -> Nonlinearity -> Conv2D (with a 1x1 filter) in your model.

On the other hand, if you are directly using Tensorflow Keras' models, such as mobilenetv2, you will have immediate access to the original implementation containing the intra-layer batch normalization with nonlinearity.

As a final note, if you plan on deploying int8 quantized models using Separable Depthwise Convolutions, I would recommend using the SeperableConv2D layer without the batch norm and nonlinearity in the middle. There is ample research demonstrating how fluctuating inter-channel and inter-layer dynamic ranges of the quantized trained weights results in significant accuracy loss.

Can you share some references on your last point: Batch norm and nonlinearity between channels and inter-layers affecting the quantized model accuracy? — Louis Yang, May 11 '23 at 04:10
A summary can be found here: https://arxiv.org/pdf/2106.08295.pdf "Several papers (Krishnamoorthi, 2018; Nagel et al., 2019; Sheng et al., 2018b) noted that efficient models with depth-wise separable convolutions, such as MobileNetV1 (Howard et al., 2017) and MobileNetV2 (Sandler et al., 2018), show a significant drop for PTQ or even result in random performance." — Austin, May 11 '23 at 15:49

Timbus Calin · Answer 2 · 2020-09-24T06:39:45.290

Honestly, the only way to know is to load the model via

from tensorflow.keras.applications import MobileNet
model = MobileNet()
model.summary()

Indeed, when you check the results, the only present layer is DepthwiseConv2D.

In fact, inspecting the model.summary() yields us the following results:

(Note that this is a block of Depthwise + Pointwise)

conv_pad_6 (ZeroPadding2D) (None, 29, 29, 256) 0
_________________________________________________________________ conv_dw_6 (DepthwiseConv2D) (None, 14, 14, 256) 2304
_________________________________________________________________ conv_dw_6_bn (BatchNormaliza (None, 14, 14, 256) 1024
_________________________________________________________________ conv_dw_6_relu (ReLU) (None, 14, 14, 256) 0
_________________________________________________________________ conv_pw_6 (Conv2D) (None, 14, 14, 512) 131072
_________________________________________________________________ conv_pw_6_bn (BatchNormaliza (None, 14, 14, 512) 2048
_________________________________________________________________ conv_pw_6_relu (ReLU) (None, 14, 14, 512) 0

The first three layers perform depthwise separable convolution while pointwise convolution is performed by the last three layers. You can see from the name of the layers which layers are part of the first operation (dw) and the second one (pw).

By inspecting those layers we can also see the order of the operations, i.e. that the batch normalization operation takes place before the relu activation. This is valid for both the depthwise convolution and the pointwise convolution, as you can see in the description above.

However your observation is indeed good, since there is no 1x1 Convolution present in the architecture, at least as per model.summary()

From the Keras/TF documentation:

"""Depthwise separable 2D convolution.

Depthwise Separable convolutions consist of performing just the first step in a depthwise spatial convolution (which acts on each input channel separately).

Yes, according to the original paper it should be like that. — Timbus Calin, Sep 28 '20 at 05:52

Does SeparableConv2D include Batch Normalisation and Relu after the Depthwise operation?

2 Answers2