MXNET softmax output: label shape confusion

Question

I have not got a clear idea about how labels for the softmax classifier should be shaped.

What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).

What puzzles me though is that:

I can use sclalar label values that go beyong indexing, like 4 in my example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value, the model converges to uniform probablity distribution over classes. For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient reifnrocement learning, and my negative rewards lead to the issue above: uniform probability. Is that expected?

require(mxnet)

actor_initializer <- mx.init.Xavier(rnd_type = "gaussian", factor_type = "avg", magnitude = 0.0001)

actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')

device.cpu <- mx.cpu()

NN architecture

actor_fc3 <- mx.symbol.FullyConnected( data = actor_nn_data , num_hidden = 3 )

actor_output <- mx.symbol.SoftmaxOutput( data = actor_fc3 , label = actor_nn_label , name = 'actor' )

crossentfunc <- function(label, pred) { - sum(label * log(pred)) }

actor_loss <- mx.metric.custom( feval = crossentfunc , name = "log-loss" )

initialize NN

actor_train.x <- matrix(rnorm(11), nrow = 1)

actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)

rm(actor_model)

actor_model <- mx.model.FeedForward.create( symbol = actor_output, X = actor_train.x, y = actor_train.y, ctx = device.cpu, num.round = 100, array.batch.size = 1, optimizer = 'adam', eval.metric = actor_loss, clip_gradient = 1, wd = 0.01, initializer = actor_initializer, array.layout = "rowmajor" )

predict(actor_model, actor_train.x, array.layout = "rowmajor")

score 0 · Answer 1 · answered Oct 04 '18 at 19:55

It is quite strange to me, but I found a solution.

I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.

Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

MXNET softmax output: label shape confusion

NN architecture

initialize NN

1 Answers1