Usually activation layer doesn't go right after input as it should be activated once the first layer's calculation is done.
You can still achieve imitating XOR function with your old code, but it needs a few tweaks:
You are right that you need to initialize weights. It is a big discussion in Deep Learning community which initial weights are the best, but from my practice Xavier weights are working well
If you want to use softmax, you need to change the last hidden layer units quantity to 2, because you have 2 classes: 0 and 1
After doing these 2 things + few minor optimizations, like removing transposing of the matrix, with the following code:
library(mxnet)
train = matrix(c(0,0,0,
0,1,1,
1,0,1,
1,1,0),
nrow=4,
ncol=3,
byrow=TRUE)
train.x = train[,-3]
train.y = train[,3]
data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=2)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=3)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=2)
softmax <- mx.symbol.Softmax(fc3, name="sm")
mx.set.seed(0)
model <- mx.model.FeedForward.create(
softmax,
X = train.x,
y = train.y,
num.round = 50,
array.layout = "rowmajor",
learning.rate = 0.1,
momentum = 0.99,
eval.metric = mx.metric.accuracy,
initializer = mx.init.Xavier(rnd_type = "uniform", factor_type = "avg", magnitude = 3),
epoch.end.callback = mx.callback.log.train.metric(100))
predict(model,train.x,array.layout="rowmajor")
We get the following results:
Start training with 1 devices
[1] Train-accuracy=NaN
[2] Train-accuracy=0.75
[3] Train-accuracy=0.5
[4] Train-accuracy=0.5
[5] Train-accuracy=0.5
[6] Train-accuracy=0.5
[7] Train-accuracy=0.5
[8] Train-accuracy=0.5
[9] Train-accuracy=0.5
[10] Train-accuracy=0.75
[11] Train-accuracy=0.75
[12] Train-accuracy=0.75
[13] Train-accuracy=0.75
[14] Train-accuracy=0.75
[15] Train-accuracy=0.75
[16] Train-accuracy=0.75
[17] Train-accuracy=0.75
[18] Train-accuracy=0.75
[19] Train-accuracy=0.75
[20] Train-accuracy=0.75
[21] Train-accuracy=0.75
[22] Train-accuracy=0.5
[23] Train-accuracy=0.5
[24] Train-accuracy=0.5
[25] Train-accuracy=0.75
[26] Train-accuracy=0.75
[27] Train-accuracy=0.75
[28] Train-accuracy=0.75
[29] Train-accuracy=0.75
[30] Train-accuracy=0.75
[31] Train-accuracy=0.75
[32] Train-accuracy=0.75
[33] Train-accuracy=0.75
[34] Train-accuracy=0.75
[35] Train-accuracy=0.75
[36] Train-accuracy=0.75
[37] Train-accuracy=0.75
[38] Train-accuracy=0.75
[39] Train-accuracy=1
[40] Train-accuracy=1
[41] Train-accuracy=1
[42] Train-accuracy=1
[43] Train-accuracy=1
[44] Train-accuracy=1
[45] Train-accuracy=1
[46] Train-accuracy=1
[47] Train-accuracy=1
[48] Train-accuracy=1
[49] Train-accuracy=1
[50] Train-accuracy=1
>
> predict(model,train.x,array.layout="rowmajor")
[,1] [,2] [,3] [,4]
[1,] 0.9107883 2.618128e-06 6.384078e-07 0.9998743534
[2,] 0.0892117 9.999974e-01 9.999994e-01 0.0001256234
'''
The output of softmax is interpreted as "a probability of belonging to a class" - it is not a "0" or "1" value as one gets after doing regular math. The answer means the following:
- In case "0 and 0": probability of class "0" = 0.9107883 and of class "1" = 0.0892117, meaning it is 0
- In case "0 and 1": probability of class "0" = 2.618128e-06 and of class "1" = 9.999974e-01, meaning it is 1 (probability of 1 is much higher)
- In case "1 and 0": probability of class "0" = 6.384078e-07 and of class "1" = 9.999994e-01 (probability of 1 is much higher)
- In case "1 and 1": probability of class "0" = 0.9998743534 and of class "1" = 0.0001256234, meaning it is 0.