How to improve this cntk xor implementation

Question

I've implemented the "xor problem" with cntk (python).

Currently it solves the problem only occasionally. How could I implement a more reliable network?

I guess the problem gets solved whenever the starting random weights are near optimal. I have tried binary_cross_entropy as the loss function but it didn't improve. I tried tanh as the non-linear function but that it didn't work either. I have also tried many different combinations of parameters learning_rate, minibatch_size and num_minibatches_to_train. Please help.

Thanks

# -*- coding: utf-8 -*-

import numpy as np
from cntk import *
import random
import pandas as pd

input_dim = 2
output_dim = 1

def generate_random_data_sample(sample_size, feature_dim, num_classes):
    Y = []
    X = []
    for i in range(sample_size):
        if i % 4 == 0:
            Y.append([0])
            X.append([1,1])
        if i % 4 == 1:
            Y.append([0])
            X.append([0,0])
        if i % 4 == 2:
            Y.append([1])
            X.append([1,0])
        if i % 4 == 3:
            Y.append([1])
            X.append([0,1])

    return np.array(X,dtype=np.float32), np.array(Y,dtype=np.float32)   

def linear_layer(input_var, output_dim,scale=10):
    input_dim = input_var.shape[0]

    weight = parameter(shape=(input_dim, output_dim),init=uniform(scale=scale))
    bias = parameter(shape=(output_dim))

    return bias + times(input_var, weight)

def dense_layer(input_var, output_dim, nonlinearity,scale=10):
    l = linear_layer(input_var, output_dim,scale=scale)

    return nonlinearity(l)


feature = input(input_dim, np.float32)
h1 = dense_layer(feature, 2, sigmoid,scale=10)
z = dense_layer(h1, output_dim, sigmoid,scale=10)

label=input(1,np.float32)
loss = squared_error(z,label)
eval_error = squared_error(z,label)


learning_rate = 0.5
lr_schedule = learning_rate_schedule(learning_rate, UnitType.minibatch) 
learner = sgd(z.parameters, lr_schedule)
trainer = Trainer(z, (loss, eval_error), [learner])

def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss, eval_error = "NA", "NA"

    if mb % frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        if verbose: 
            print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}".format(mb, training_loss, eval_error))

    return mb, training_loss, eval_error

minibatch_size = 800
num_minibatches_to_train = 2000
training_progress_output_freq = 50

for i in range(0, num_minibatches_to_train):
    features, labels = generate_random_data_sample(minibatch_size, input_dim, output_dim)
    trainer.train_minibatch({feature : features, label : labels})
    batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq, verbose=1)

out = z
result = out.eval({feature : features})
a = pd.DataFrame(data=dict(
        query=[str(int(x[0]))+str(int(x[1])) for x in features],
        test=[int(l[0]) for l in labels],
        pred=[l[0] for l in result]))
print(pd.DataFrame.drop_duplicates(a[["query","test","pred"]]).sort_values(by="test"))

Vasco · Answer 1 · 2019-03-26T00:20:48.680

I was able to improve stability by adding more hidden layers with h1 = dense_layer(feature, 5, sigmoid,scale=10) and increasing the learning rate to learning_rate = 0.8.

This improved stability but it still got it wrong from time to time. Additionally modifying the loss to binary cross entropy loss = binary_cross_entropy(z,label) improved the chances of getting it right substantially.

Before:

Minibatch: 1900, Loss: 0.1272, Error: 0.13
Minibatch: 1950, Loss: 0.1272, Error: 0.13
  query  test      pred
0    11     0  0.502307
1    00     0  0.043964
2    10     1  0.951571
3    01     1  0.498055

After:

Minibatch: 1900, Loss: 0.0041, Error: 0.00
Minibatch: 1950, Loss: 0.0040, Error: 0.00
  query  test      pred
0    11     0  0.006617
1    00     0  0.000529
2    10     1  0.997219
3    01     1  0.994183

Also modifying the scale from 10 to 1 as suggested by Davi improved the convergence speed:

scale 10:

Minibatch: 1300, Loss: 0.0732, Error: 0.01
Minibatch: 1350, Loss: 0.0483, Error: 0.00

scale 1:

Minibatch: 500, Loss: 0.0875, Error: 0.01
Minibatch: 550, Loss: 0.0639, Error: 0.00

In conclusion, what was needed was:

change scale from 10 to 1 (for a stable solver it needs substantially more iterations)
add more hidden layers from 2 to 5 (overcomes issues that come from scale = 10 but needs more iterations occasionally)
modify loss function from squared_error to binary_cross_entropy (converges faster, i.e. it is more efficient in searching for the right weights)

score 0 · Answer 2 · answered May 30 '17 at 04:00

0

I don't think you can really "solve" XOR by directly mapping input to output with some weight and bias. You will need at least one hidden layer (with at least two nodes) between them.

answered May 30 '17 at 04:00

Zp Bappi

121
5

I would say, it tries to do that. XOR is not a linear problem that we can solve without a hidden layer between input and output. We can only solve OR, AND and NOT without any hidden layers. But, we will definitely need one hidden layers to do XOR, XNOR, etc. There is no planer separation between the data points for XOR and that's why you will need a hidden to introduce the non-linearity required. [This](https://stackoverflow.com/a/6528548/6096144) answer explains it with a nice diagram. – Zp Bappi May 30 '17 at 04:46
I thought I was doing that. The first layer is `h1` the second layer which is also the last layer is `z`. What do you mean by "tries to do that"? – Vasco May 30 '17 at 05:20
you only have one weight and one bias. lets say, W and B. So, what your code is doing is `y = Input * W + B`. However, to implement XOR, what you need to do at minimum is: `h1 = sigmoid(Input * W1 + B1) h2 = sigmoid(Input * W2 + B2) y = sigmoid([h1 h2] * W + B)` or, in CNTK, simply: `model = Sequential([ Dense(2), Dense(1) ])` * with appropriate scaling and activation of course. – Zp Bappi May 30 '17 at 05:57
@Vasco sorry, my browser was not loading the full code. I see now that you are creating hidden layers using those function. sorry for the comment storm. :P – Zp Bappi May 30 '17 at 06:06

score 0 · Answer 3 · edited Mar 10 '19 at 03:12

Changing the four instances of scale=10 to scale=1 seems to fix the script.

I made no other changes and was able to run it several successive times and get decent results with 2000 iterations. Of course increasing the iterations (eg. 20,000 or more) gives much better results.

Possibly the original range of -10 to 10 for initial weights was allowing occasional very large weights to saturate some neurons and interfere with training. This effect might be further accentuated by greedy learning rates.

Also an XOR net is pretty sparse compared to the current trend for deep nets. It might be harder for a few saturated neurons to lock up training of a deep net - but maybe not impossible.

In the days of yore I seem to recall we often set initial weights to be relatively small and distributed around zero. Not sure what the theorists are recommending now.

Maybe some outputs would be helpful to understand `seems to fix the script`, sort of before / after. I feel that the library versions used might have an impact on the outputs. — LoneWanderer, Mar 10 '19 at 02:29
OK, I added the outputs. Also tested with cntk version 2.6 which is the latest. — Vasco, Mar 24 '19 at 05:33

score 0 · Answer 4 · edited Mar 10 '19 at 15:04

Running the script as provided by the first poster invariably yields results similar to this (only the tail-end of results given here) - this is the before.:

...
Minibatch: 1900, Loss: 0.1266, Error: 0.13
Minibatch: 1950, Loss: 0.1266, Error: 0.13
  query  test      pred
0    11     0  0.501515
1    00     0  0.037678
2    10     1  0.497704
3    01     1  0.966931

I just reran this several times with similar results. Even increasing the iteration to 20,000 gives similar results. This script as originally constituted does not seem to result in a viable solution to the XOR problem. Training of the net does not converge to the XOR truth table, and the error and loss do not converge towards zero.

Changing the 4 instances of scale=10 to scale=1 invariably seems to produce a viable solution for the XOR problem. Typical results are below. This is the after.

...
Minibatch: 1900, Loss: 0.0129, Error: 0.01
Minibatch: 1950, Loss: 0.0119, Error: 0.01
  query  test      pred
0    11     0  0.115509
1    00     0  0.084174
2    10     1  0.891398
3    01     1  0.890891

Several re-runs produce a similar result. Training seems to converge towards the XOR truth table and error and loss converge toward zero. Increasing iterations to 20,000 yields the following typical result. Training now produces a viable XOR solution, and the script seems to be "fixed".

...
Minibatch: 19900, Loss: 0.0003, Error: 0.00
Minibatch: 19950, Loss: 0.0003, Error: 0.00
  query  test      pred
0    11     0  0.017013
1    00     0  0.015626
2    10     1  0.982118
3    01     1  0.982083

To be more precise, the proposed script change likely fixes the method used to set the weight initial conditions. I'm fairly new to CNTK so I don't know how baked in using scale=10 might be. Since most of the examples I find for CNTK programs are for deep net type problems, I suspect setting weight initial conditions using scale=10 could be related to these problem solutions most comm, only posted on the net.

Finally, there were no changes (installs or updates) in libraries on my system during the course of these tests. So an assertion that there is a problem with library versions seems to have no basis in the facts.

Interesting. Without changing the iteration times it worked 3 out of 5 attempts and increasing the iterations worked 5 out of 5 attempts. — Vasco, Mar 24 '19 at 05:52

How to improve this cntk xor implementation

4 Answers4