How does the Softmax layer of Caffe calculates the probability values?

Question

Does anyone know what computations take place inside the Caffe softmax layer?

I am using a pre-trained network with a softmax layer at the end.

In testing phase, for a simple forward of an image, the output of the second-last layer ("InnerProduct") is the following: -0.20095, 0.39989, 0.22510, -0.36796, -0.21991, 0.43291, -0.22714, -0.22229, -0.08174, 0.01931, -0.05791, 0.21699, 0.00437, -0.02350, 0.02924, -0.28733, 0.19157, -0.04191, -0.07360, 0.30252

The last layer's ("Softmax") output is the following values: 0.00000, 0.44520, 0.01115, 0.00000, 0.00000, 0.89348, 0.00000, 0.00000, 0.00002, 0.00015, 0.00003, 0.00940, 0.00011, 0.00006, 0.00018, 0.00000, 0.00550, 0.00004, 0.00002, 0.05710

If i apply a Softmax (using an external tool, like matlab) on the inner product layer's output i get the following values: 0.0398, 0.0726, 0.0610, 0.0337, 0.0391, 0.0751, 0.0388, 0.0390, 0.0449, 0.0496, 0.0460, 0.0605, 0.0489, 0.0476, 0.0501, 0.0365, 0.0590, 0.0467, 0.0452, 0.0659

The latter makes sense to me, since the probabilities add up to 1.0 (notice that the sum of Caffe's Softmax layer values is > 1.0).

Apparently, the softmax layer in Caffe is not a straight-forward Softmax operation.

(I do not think that it makes any difference, but i will just mention that i am using the pre-trained flickr style network, see description here).

EDIT:

Here is the definition of the two last layers in the proto txt. Notice that the type of the last layer is "Softmax".

layer {
  name: "fc8_flickr"
  type: "InnerProduct"
  bottom: "fc7"
  top: "fc8_flickr"
  param {
    lr_mult: 10
    decay_mult: 1
  }
  param {
    lr_mult: 20
    decay_mult: 0
  }
  inner_product_param {
    num_output: 20
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "prob"
  type: "Softmax"
  bottom: "fc8_flickr"
  top: "prob"
}

you can read the code yourself: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/softmax_layer.cpp — Shai, May 17 '17 at 11:33
"softmax" does _not_ calculate probabilities. It fundamentally can't, as you discovered it doesn't even enforce that the sum of all outputs is <=1.0. Exclusive probabilities don't necessarily sum to 1.0 ("none of the above" can have non-zero probability), but they can never sum to >1.0. ("none of the above" cannot have a negative probability) — MSalters, May 17 '17 at 12:04
@Shai Yes all tests complete successful. To clarify myself here, i do not assume that there is a bug or something in Caffe. I just want to know what operations take place in the Softmax layer. — GrimFix, May 18 '17 at 09:43
@Shai I checked the code and i cannot say i understand what is happening. Most of the comments describe a softmax operation ("...subtract the max", "exponentiation", "sum after exp", "division") but some scaling also takes place. — GrimFix, May 18 '17 at 09:46
@MSalters I know that softmax does not calculate probabilities. Yet, from the wikipedia page, the softmax operation is defined as "a function that "squashes" a K-dimensional vector of arbitrary real values to a K-dimensional vector of real values in the range [0, 1] that ADD UP TO 1". I am not saying that Caffe is buggy or something but i am missing something here. — GrimFix, May 18 '17 at 09:51

Shai · Answer 1 · 2017-05-18T10:04:52.210

1

The results you are getting are weird.
The operations carried out by "Softmax" layer's forward method are:

(Note that the first two steps are carried out to prevent overflow in the computation).

edited May 18 '17 at 10:04

answered May 18 '17 at 09:57

Shai

111,146
38
238
371

1

I don't think the reason is numerical stability (the algorithm is stable regardless). It does prevent overflow, though: after step 2, all values are non-positive, so after step 3 all values are <=1.0. This prevents overflow. Now, you may get _underflow_ instead, but for a Soft**Max** operation the small terms don't really matter anyway. – MSalters May 18 '17 at 10:02
Thank you for taking the time to answer and clarifying the steps. This is what i do and i don't get the same results as Caffe's Softmax layer. You try the following simple MatLab code: `v = [ -0.20095, 0.39989, 0.22510, -0.36796, -0.21991, 0.43291, -0.22714, -0.22229, -0.08174, 0.01931, -0.05791, 0.21699, 0.00437, -0.02350, 0.02924, -0.28733, 0.19157, -0.04191, -0.07360, 0.30252]; v = v - r_max; v_exp = exp(v); v_exp_sum = sum(v_exp); v_softmax = v_exp ./ v_exp_sum; disp(v_softmax)` Is there something different in my MatLab code than what you suggested? – GrimFix May 18 '17 at 14:32
@GrimFix this is why I said your result.is wierd – Shai May 18 '17 at 14:49
@GrimFix I run at the same behaviour. Did you get any insight on this? I think it might have to do with the "axis" property of SoftMax (I read about this parameter somewhere in Caffe layers description). – Alexey Chernyavskiy Sep 28 '17 at 15:08

How does the Softmax layer of Caffe calculates the probability values?

1 Answers1