0

When defining segmentation network for R G B images, such as the network in fcn-xs example on mxnet, the input RGB image layer is fed to multiple convolutions, activations, poolings, etc...

Convolution, for example, is defined as below: mxnet.symbol.Convolution(data=input, kernel=(3, 3), pad=(1, 1), num_filter=64, workspace=workspace_default, name="conv1_1")

On the one hand, convolution filters here are 2D, meaning each color layer R,G,B is processed separately. On the other hand, it is well known from neuroscience that relevant features are contained in the color contrast, rather than in the color channel itself, i.e., the colors should be subtracted from each other, e.g. Red minus Green or Blue minus Yellow.

How to enforce it by network structure? How the R G B components are mixed and combined?

Eldar Ron
  • 71
  • 4
  • (strange talking to myself....) I guess that by defining kernel=(2, 1, 1), pad=(0, 0, 0) I can have a degenerated 3-d filter that operates also on the color dimension. With some luck (a.k.a. as proper training) I will obtain color contrast on the output, i.e. the filter will have two coefficients, one positive and one negative, that sum more or less to zero. What really surprises me is that the FCN authors have not thought of that. Am I on the right path??? – Eldar Ron Feb 05 '17 at 17:49

1 Answers1

2

It turns out that convolutions in mxnet are 3D: first two dimensions reflect image coordinates, while the third dimension reflects the depth, i.e., the dimension of the feature space. For an RGB image at the input layer, depth is 3 (unless it is a grayscale image that has depth==1). For any other layer, depth is the number of features (filters).

It is therefore not needed to specify explicitly the convolution along the depth dimension, it is always (implicitly) assumed. As a result, color contrast and other features involving data from several channels can be extracted. For example, adding up horizontal and vertical features can yield a corner detector...

Eldar Ron
  • 71
  • 4