Training neural network with Apache mxnet (gluon) causes program to crash

Question

I am trying to train a convolutional neural network with mxnet using the Gluon API on a set of images I want to classify. However, the same network and code sometimes outputs extremely different results for the same data, and on occasion simply crashes and refuses to run for some reason. Here is my code:

Additional information:

Images are all 131 x 131 px size, 176 images per class (2 classes) and 40 test per class. I'm confused as to why the same program for the same data should sometimes give output but otherwise crash.

Imports

from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd, gluon
import time
mx.random.seed(1)

Setting context

ctx = mx.cpu()

Defining callback transform function

def transform(data, label):
    return nd.transpose(data.astype(np.float32), (2, 0, 1))/255, label

Defining batch size and number of nodes in o/p layer

batch_size = 5
num_outputs = 2

Load training and test data

train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.ImageFolderDataset("/somepath/train", 0, transform), batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.ImageFolderDataset("/somepath/test", 0, transform), batch_size, shuffle=False)

Define CNN using gluon.nn

neural_net = gluon.nn.Sequential()
num_fc = 512

with neural_net.name_scope():
    neural_net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
    neural_net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
    neural_net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    neural_net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))
    neural_net.add(gluon.nn.Flatten())
    neural_net.add(gluon.nn.Dense(num_fc, activation="relu"))
    neural_net.add(gluon.nn.Dense(num_outputs))

Initialize params, loss fn, and trainer object

neural_net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
trainer = gluon.Trainer(neural_net.collect_params(), 'adadelta')

Training Loop

total_time = 0
for e in range(2):
    tick = time.time()
    for idx, (dpoint, label) in enumerate(train_data):
        data = dpoint.as_in_context(ctx)
        label = label.as_in_context(ctx)

        with autograd.record():
            output = neural_net(data)
            loss2 = cross_entropy(output, label)

        loss2.backward()    
        trainer.step(data.shape[0])
    tock = time.time()
    print("Epoch %s. Took %s seconds to train" %(e, tock-tick))
    total_time += tock-tick
print("Total training time: %s" %(total_time))

Measuring accuracy

acc = mx.metric.Accuracy()
for idx, (data, label) in enumerate(test_data):
    something = data.as_in_context(ctx)
    something_label = label.as_in_context(ctx)

    output2 = neural_net(something)
    predictions = nd.argmax(output2, axis=1)

    acc.update(predictions, something_label)
print(acc.get()[-1])

@SinaAfrooze there is no output, it just becomes non-responsive — dntk, Feb 02 '18 at 04:43
Does it become unresponsive during training or while calculating Accuracy()? — Sina Afrooze, Feb 09 '18 at 00:30
@SinaAfrooze the particular line that causes it to go unresponsive is "output2 = neural_net(...)", i.e. forward prop during evaluation of accuracy — dntk, Feb 09 '18 at 05:54
That's very strange. Can you run it in an IDE with debugger (like PyCharm) and see where it gets stuck by pausing the execution when it does get stuck? Also, what version of MXNet are you running? — Sina Afrooze, Feb 10 '18 at 03:44

score 1 · Accepted Answer · answered Feb 22 '18 at 17:52

Your network might be taking a long time to compute the forward and backward passes through the data. I tracked down the perceived unresponsiveness to the acc.update call (a little later than neural_net(...)). Digging deeper into this function, we're waiting for nd.asnumpy to resolve.

Confusion lies with the fact that MXNet NDArray computations are asynchronous. All the training forward/backward pass operations appear to resolve instantly but are in fact added to a queue to processing. It's only when data is brought back into the python process (via nd.asnumpy) that you have to wait for the relevant operations to finish. And this happens for the first time in acc.update.

Another way of benchmarking performance of certain code blocks is to use mx.nd.waitall() which blocks the code until the computation queue is empty. Adding this to your training cycle, you can see that it takes much longer than it initially appears.

Using a GPU would likely help this apparent unresponsiveness.

Thank you so much, that helps a lot. – dntk Mar 05 '18 at 04:33 — dntk, Mar 05 '18 at 04:33

score 1 · Answer 2 · answered Feb 28 '18 at 21:33

Thom's answer is correct, you need to explicitly wait for the work to complete otherwise operations are simply enqueued. Have a look at this section of the documentation

That is the relevant code snippet from the docs that shows you where you would put the mx.nd.waitall()

for batch in train_data:
    train_batch(batch, dev_params, ctx, lr)
nd.waitall()  # wait all computations are finished to benchmark the time
print('Epoch %d, training time = %.1f sec'%(epoch, time()-start))

In your example that would be here:

nd.waitall()
tock = time.time()
print("Epoch %s. Took %s seconds to train" %(e, tock-tick))

Training neural network with Apache mxnet (gluon) causes program to crash

2 Answers2