3

Say I have a vector a , with an index vector b of the same length. The indexs are in range 0~N-1, corresponding to N groups. How can I do softmax for every group without for loop?

I'm doing some sort of attention operation here. The numbers for every group are not the same, so I can't reshape a to a matrix and use the dim in standard Softmax() API.

Toy example:

a = torch.rand(10)
a: tensor([0.3376, 0.0557, 0.3016, 0.5550, 0.5814, 0.1306, 0.2697, 0.9989, 0.4917,
        0.6306])
b = torch.randint(0,3,(1,10), dtype=torch.int64)
b: tensor([[1, 2, 0, 2, 2, 0, 1, 1, 1, 1]])

I want to do softmax like

for index in range(3):
    softmax(a[b == index])

but without the for loop to save time.

Zhang Yu
  • 559
  • 6
  • 15
  • Correct me if I'm wrong, but what you are looking for is not a softmax, but a element-wise mean squared error, right? Since your computation for softmax would assume that the total sum of values along the main axis in `a` would sum to 1. – dennlinger Jan 21 '19 at 09:11
  • 1
    Thanks for replying. Acutally I'm not computing a loss here. The indices in `b` are more proper to be considered as groups rather than classes. I want a softmax probability of every **scaler** in `a` that belong to the same indice, them use these probabilities as weights for later computation. Thus the output for every indice sum to 1, in the N groups example, the output summation should be N. – Zhang Yu Jan 23 '19 at 08:55

1 Answers1

1

Maybe this answer will have to change slightly based on a potential response to my comment, but I'm just going ahead and throwing in my two cents about Softmax.

Generally, the formula for softmax is explained rather well in the PyTorch documentation, where we can see that this is a exponential of the current value, divided by the sum of all classes.
The reason for doing this is founded in probability theory, and probably a little outside of my comfort zone, but essentially it helps you to maintain a rather simple backpropagation derivative, when using it in combination with a popular loss strategy called "Cross Entropy Loss" (CE) (see the corresponding function in PyTorch here.

Furthermore, you can also see in the description for CE that it automatically combines two functions, namely a (numerically stable) version of the softmax function, as well as the negative log likelihood loss (NLLL).

Now, to tie back to your original question, and hopefully resolving your issue:
For the sake of the question - and the way you asked it - it seems you are playing around with the popular MNIST handrwitten digit dataset, in which we want to predict some values for your current input image.

I am also assuming that your output a will at some point be the output from a layer from a neural network. It does not matter whether this is squashed to a specific range or not (e.g., by applying some form of activation function), as the softmax will be basically a normalization. Specifically, it will give us, as discussed before, some form of distribution across all the predicted values, and sum up to 1 across all classes. To do this, we can simply apply something like

soft_a = softmax(a, dim=0) # otherwise throws error if we don't specify axis
print(torch.sum(soft_a)) # should return "Tensor(1.)"

Now, if we assume that you want to do the "classical" MNIST example, you could then use the argmax() function to predict which value your system thinks is the correct answer, and calculate an error based off that, e.g., with the nn.NLLLoss() function.

If you are indeed predicting values for each position in a single output, you have to think slightly different about this.
First of all, softmax() ceases to make sense here, since you are computing a probability distribution across multiple outputs, and unless you are fairly certain that their distributions are dependent on one another in a very specific way, I would argue that this is not the case here.

Also, keep in mind that you are then looking to calculate a pairwise loss, i.e. something for every index of your output. The function that comes to mind for this specific purpose would be nn.BCELoss(), which calculates a binarized (element-wise) version of Cross-Entropy. For this, you can then simply "prop in" your original prediction tensor a, as well as your ground truth tensor b. A minimal example for this would look like this:

bce = torch.nn.BCELoss(reduction="none") # to keep losses for each element separate
loss = bce(a,b) # returns tensor with respective pairwise loss

If you are interested in a single loss, you can obviously use BCELoss with a different argument for reduction, as described in the docs. Let me know if I can clarify some parts of the answer for you.

EDIT: Something else to keep in mind here: The BCELoss() requires you to feed in values that can potentially be close to the value you want to predict. This is especially a problem if you feed in values to an activation function first (e.g., sigmoid or tanh), which can then never reach the value you want to predict, since they are bound by an interval!

dennlinger
  • 9,890
  • 1
  • 42
  • 63
  • Thanks for replying, I left a comment below the question. Maybe I should put my question more clearly. By the way, shouldn't the **taget** for `BCELoss()` be only 0 or 1? – Zhang Yu Jan 23 '19 at 08:59