9

I have read the documentation about the group param:

group (g) [default 1]: If g > 1, we restrict the connectivity of each filter to a subset of the input. Specifically, the input and output channels are separated into g groups, and the ith output group channels will be only connected to the ith input group channels.

But first of all I do not understand exactly what they mean. And secondly, why would I use it. Could anyone help me to explain it a bit better?

As far as I have understood it, it means following:

If I set g greater than 1 my input and output channels are separated into groups. But how exactly is that done? If I set it to 20 and my input is 40 I will have to groups of 20? And if the output is 50 I will have one group of 20 and one group of 30?

  • If you set `group = 2` in a convolution layer, this layer will be split into 2 seperate branches(from input to output), and the layer's output is composed of the 2 branches' convolution results. – Dale Nov 30 '16 at 02:00
  • Ok thanks, but in the end it doesnt make a difference if I use group or not. It is just for a faster computation time? @Dale –  Nov 30 '16 at 12:13

3 Answers3

11

And secondly, why would I use [grouping]?

This was originally presented as an optimization in the paper which sparked the current cycle of neural network popularity :

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." In Advances in neural information processing systems, pp. 1097-1105. 2012.

Figure 2 shows how grouping was used for that work. The authors of caffe originally added this ability so they could replicate the AlexNet architecture. However grouping continues to show itself as beneficial in other scenarios.

For example both Facebook and Google have released papers which essentially show that grouping can dramatically reduce resource use while helping to preserve accuracy. The Facebook paper can be seen here:(ResNeXt) and the Google paper can be found here: (MobileNets)

twerdster
  • 4,977
  • 3
  • 40
  • 70
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Hm, I am not so sure about that. I have a network and in my `Deconvolution`-param I use group = num_output which makes my net converge. If I omit the group tag it does not converge anymore...Or at least not that fast (Might have not trained it long enough) @Martin Thoma –  Dec 02 '16 at 10:07
  • @thigi I've linked a paper you might be interested in. – Martin Thoma Jan 09 '17 at 16:41
  • 3
    This answer is actually very useful. It shouldnt have been downvoted. – twerdster May 09 '17 at 07:57
  • @MartinThoma: Thanks for your reference. In the paper, how can I denote group parameters for deconvolution layer? – John May 21 '17 at 14:56
9

The argument gives the quantity of groups, not the size. If you have 40 inputs and set g to 20, you'll get 20 "lanes" of 2 channels each; with 50 outputs, you'd get 10 groups of 2 and 10 groups of 3.

More often, you split into a small number of groups, such as 2. In that case, you'd have two processing "lanes" or groups. For the 40=>50 layer you mention, each group would have 20 inputs and 25 outputs. Each layer will split in half, with each set of forward and backward propagation working only within its own half, for the range of layers over which the group parameter applies (I think it's all the way to the final layer).

The processing advantage is that instead of 40^2 input connections, you have 2 groups of 20^2 connections, or half as many. This accelerates the processing by roughly 2x, with a very small loss in convergence progress.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • I have been seeing someone who set group = num_output in a Deconvolution layer. Why would people do that? –  Nov 30 '16 at 12:11
  • Does it make a big difference if I just omit the group param? @Prune –  Nov 30 '16 at 12:11
  • I have a network with a `Deconvolution` layer. If I set group = num_output my net converges, if I omit the group param it does not converge. Can you explain that? @Prune –  Nov 30 '16 at 12:52
  • This looks like a job for visual debugging tools. I can't do much without a detailed look at the topology and some good visualization into the progression of training. Have you applied any visualization tools to your intermediate layers to examine where the convergence differences come up? – Prune Nov 30 '16 at 16:53
  • I could make a couple of obvious guesses from the data, but this is more human back-propagation than any true understanding. *group = num_output* will converge your net all too quickly, as each channel quickly converges to its simple result. The question is whether that convergence is to a point that's actually a solution to your prediction problem. – Prune Nov 30 '16 at 16:55
  • No, not at all, What do you mean by god visualisation? Sorry if I seem a bit stupid but I am quite new to this area. So would it make more sense to omit the group param? But then I think my net does not converge anymore @Prune –  Nov 30 '16 at 22:43
  • What are visual debugging tools? Can you name some? @Prune –  Nov 30 '16 at 22:46
  • Suggesting visualization tools is off-topic for Stack Overflow. They depend very much on your current software environment. You'll need to research some to find out what works for your configuration. – Prune Nov 30 '16 at 23:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/129480/discussion-between-prune-and-thigi). – Prune Nov 30 '16 at 23:10
  • Ok, but what is the actual meaning of **group** then? Or why does one use it? Why does it even exist? @Prune –  Dec 01 '16 at 01:36
0

First of all, Caffe only definite the behave while group is multiple of both input_channel and output_channel. We can confirm this from the source code:

CHECK_EQ(channels_ % group_, 0);
CHECK_EQ(num_output_ % group_, 0)
  << "Number of output should be multiples of group.";

Secondly, the parameter group is related to the number of filter paramters, specifically, to the channel size of filter. The actual number of each filter is input_channel/group. This could also be confirmed from the source code:

vector<int> weight_shape(2);
weight_shape[0] = conv_out_channels_;
weight_shape[1] = conv_in_channels_ / group_;

Note here that weight_shape[0] is the number of filer.


So, w.r.t your question:

in Caffe, if the input_channel is 40 and the group is 20:

  1. the output_channel may not be 50.
  2. if output_channel is 20 (remember it means you have 20 filters), each 2 input channels take charge of one output channel. For example, the 0th output channel is computed from the 0th and 1th input channels and has no relationship with others input channels.
  3. if output_channel equals to input_channel (i.e.output_channel = 40), this is actually the well-known depthwise convolution. Each output channel is computed from only one different input channel.

w.r.t Deconvolution:

We almost always set group = output_channels. Here is the suggested config for Deconvolution layer from the official doc:

layer {
  name: "upsample", type: "Deconvolution"
  bottom: "{{bottom_name}}" top: "{{top_name}}"
  convolution_param {
    kernel_size: {{2 * factor - factor % 2}} stride: {{factor}}
    num_output: {{C}} group: {{C}}
    pad: {{ceil((factor - 1) / 2.)}}
    weight_filler: { type: "bilinear" } bias_term: false
  }
  param { lr_mult: 0 decay_mult: 0 }
}

with the followed instruction:

By specifying num_output: {{C}} group: {{C}}, it behaves as channel-wise convolution. The filter shape of this deconvolution layer will be (C, 1, K, K) where K is kernel_size, and this filler will set a (K, K) interpolation kernel for every channel of the filter identically. The resulting shape of the top feature map will be (B, C, factor * H, factor * W). Note that the learning rate and the weight decay are set to 0 in order to keep coefficient values of bilinear interpolation unchanged during training.

qun
  • 711
  • 8
  • 7