MxNet with R: Simple XOR Neural Network is not learning

Question

i am wanted to experiment with the MxNet library and built a simple neural network which learns the XOR function. I am facing the problem, that the model is not learning.

Here is the complete script:

library(mxnet)

train = matrix(c(0,0,0,
                 0,1,1,
                 1,0,1,
                 1,1,0),
               nrow=4,
               ncol=3,
               byrow=TRUE)

train.x = train[,-3]
train.y = train[,3]

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=1)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X = t(train.x),
  y = train.y,
  num.round = 10,
  array.layout = "columnmajor",
  learning.rate = 0.01,
  momentum = 0.4,
  eval.metric = mx.metric.accuracy,
  epoch.end.callback = mx.callback.log.train.metric(100))

predict(model,train.x,array.layout="rowmajor")

An this output is produced:

Start training with 1 devices
[1] Train-accuracy=NaN
[2] Train-accuracy=0.5
[3] Train-accuracy=0.5
[4] Train-accuracy=0.5
[5] Train-accuracy=0.5
[6] Train-accuracy=0.5
[7] Train-accuracy=0.5
[8] Train-accuracy=0.5
[9] Train-accuracy=0.5
[10] Train-accuracy=0.5

> predict(model,train.x,array.layout="rowmajor")
[,1] [,2] [,3] [,4]
[1,]    1    1    1    1

How should I use mxnet to get this example working?

Regards, vaka

score 1 · Accepted Answer · answered Feb 17 '18 at 00:01

Usually activation layer doesn't go right after input as it should be activated once the first layer's calculation is done. You can still achieve imitating XOR function with your old code, but it needs a few tweaks:

You are right that you need to initialize weights. It is a big discussion in Deep Learning community which initial weights are the best, but from my practice Xavier weights are working well
If you want to use softmax, you need to change the last hidden layer units quantity to 2, because you have 2 classes: 0 and 1

After doing these 2 things + few minor optimizations, like removing transposing of the matrix, with the following code:

library(mxnet)

train = matrix(c(0,0,0,
                 0,1,1,
                 1,0,1,
                 1,1,0),
               nrow=4,
               ncol=3,
               byrow=TRUE)

train.x = train[,-3]
train.y = train[,3]

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2)
softmax <- mx.symbol.Softmax(fc3, name="sm")

mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X = train.x,
  y = train.y,
  num.round = 50,
  array.layout = "rowmajor",
  learning.rate = 0.1,
  momentum = 0.99,
  eval.metric = mx.metric.accuracy,
  initializer = mx.init.Xavier(rnd_type = "uniform", factor_type = "avg", magnitude = 3),
  epoch.end.callback = mx.callback.log.train.metric(100))

predict(model,train.x,array.layout="rowmajor")

We get the following results:

Start training with 1 devices
[1] Train-accuracy=NaN
[2] Train-accuracy=0.75
[3] Train-accuracy=0.5
[4] Train-accuracy=0.5
[5] Train-accuracy=0.5
[6] Train-accuracy=0.5
[7] Train-accuracy=0.5
[8] Train-accuracy=0.5
[9] Train-accuracy=0.5
[10] Train-accuracy=0.75
[11] Train-accuracy=0.75
[12] Train-accuracy=0.75
[13] Train-accuracy=0.75
[14] Train-accuracy=0.75
[15] Train-accuracy=0.75
[16] Train-accuracy=0.75
[17] Train-accuracy=0.75
[18] Train-accuracy=0.75
[19] Train-accuracy=0.75
[20] Train-accuracy=0.75
[21] Train-accuracy=0.75
[22] Train-accuracy=0.5
[23] Train-accuracy=0.5
[24] Train-accuracy=0.5
[25] Train-accuracy=0.75
[26] Train-accuracy=0.75
[27] Train-accuracy=0.75
[28] Train-accuracy=0.75
[29] Train-accuracy=0.75
[30] Train-accuracy=0.75
[31] Train-accuracy=0.75
[32] Train-accuracy=0.75
[33] Train-accuracy=0.75
[34] Train-accuracy=0.75
[35] Train-accuracy=0.75
[36] Train-accuracy=0.75
[37] Train-accuracy=0.75
[38] Train-accuracy=0.75
[39] Train-accuracy=1
[40] Train-accuracy=1
[41] Train-accuracy=1
[42] Train-accuracy=1
[43] Train-accuracy=1
[44] Train-accuracy=1
[45] Train-accuracy=1
[46] Train-accuracy=1
[47] Train-accuracy=1
[48] Train-accuracy=1
[49] Train-accuracy=1
[50] Train-accuracy=1
> 
> predict(model,train.x,array.layout="rowmajor")
          [,1]         [,2]         [,3]         [,4]
[1,] 0.9107883 2.618128e-06 6.384078e-07 0.9998743534
[2,] 0.0892117 9.999974e-01 9.999994e-01 0.0001256234
'''

The output of softmax is interpreted as "a probability of belonging to a class" - it is not a "0" or "1" value as one gets after doing regular math. The answer means the following:

In case "0 and 0": probability of class "0" = 0.9107883 and of class "1" = 0.0892117, meaning it is 0
In case "0 and 1": probability of class "0" = 2.618128e-06 and of class "1" = 9.999974e-01, meaning it is 1 (probability of 1 is much higher)
In case "1 and 0": probability of class "0" = 6.384078e-07 and of class "1" = 9.999994e-01 (probability of 1 is much higher)
In case "1 and 1": probability of class "0" = 0.9998743534 and of class "1" = 0.0001256234, meaning it is 0.

score 0 · Answer 2 · answered Feb 13 '18 at 22:40

Ok, I tried a little bit more and now I have a working example of XOR with mxnet in R. The complicated part is not the mxnet API but the usage of neural networks, instead.

So here is the working R-code:

library(mxnet)

train = matrix(c(0,0,0,
                 0,1,1,
                 1,0,1,
                 1,1,0),
               nrow=4,
               ncol=3,
               byrow=TRUE)

train.x = t(train[,-3])
train.y = t(train[,3])

data <- mx.symbol.Variable("data")
act0 <- mx.symbol.Activation(data, name="relu1", act_type="relu")
fc1 <- mx.symbol.FullyConnected(act0, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu2", act_type="tanh")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu3", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=1)
act3 <- mx.symbol.Activation(fc3, name="relu4", act_type="relu")
softmax <- mx.symbol.LinearRegressionOutput(act3, name="sm")

mx.set.seed(0)
model <- mx.model.FeedForward.create(
  softmax,
  X = train.x,
  y = train.y,
  num.round = 10000,
  array.layout = "columnmajor",
  learning.rate = 10^-2,
  momentum = 0.95,
  eval.metric = mx.metric.rmse,
  epoch.end.callback = mx.callback.log.train.metric(10),
  lr_scheduler=mx.lr_scheduler.FactorScheduler(1000,factor=0.9),
  initializer=mx.init.uniform(0.5)
  )

predict(model,train.x,array.layout="columnmajor")

There are some differences to the initial code:

I changed the layout of the neural net by putting another activation layer between the data and the first layer. I was interpreting this as putting weights between the data and the input layer (is that correct?)
I changed the activation function of the hidden layer (with 3 neurons) to tanh, because I guess that negative weights are needed for XOR
I changed SoftmaxOutput to LinearRegressionOutput as to optimize for square loss instead
Fine tuned learning rate and momentum
Most important: I added the uniform initializer for the weights. I guess the default mode is setting weights to zero. Learning is really speed up when using randomized init weights.

The output:

Start training with 1 devices
[1] Train-rmse=NaN
[2] Train-rmse=0.706823888574888
[3] Train-rmse=0.705537411582449
[4] Train-rmse=0.701298592443344
[5] Train-rmse=0.691897326795625
...
[9999] Train-rmse=1.07453801496744e-07
[10000] Train-rmse=1.07453801496744e-07
> predict(model,train.x,array.layout="columnmajor")
     [,1]      [,2] [,3] [,4]
[1,]    0 0.9999998    1    0

MxNet with R: Simple XOR Neural Network is not learning

2 Answers2