Gluon code not much faster on GPU than on CPU

Question

we're training a network for a recommender system, on triplets. The core code for the fit method is as follows:

for e in range(epochs):
    start = time.time()

    cumulative_loss = 0

    for i, batch in enumerate(train_iterator):
        # Forward + backward.
        with autograd.record():
            output = self.model(batch.data[0])
            loss = loss_fn(output, batch.label[0])

        # Calculate gradients
        loss.backward()
        # Update parameters of the network.
        trainer_fn.step(batch_size)
        # Calculate training metrics. Sum losses of every batch.
        cumulative_loss += nd.mean(loss).asscalar()
    train_iterator.reset()

where the train_iterator is a custom iterator class that inherits from mx.io.DataIter, and is returning the data ( triples) already in the appropriate context, as:

        data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
        labels = [mx.nd.array(data[:, -1], self.ctx)]
        return mx.io.DataBatch(data, labels)

self.model.initialize(ctx=mx.gpu(0)) was also called before running the fit method. loss_fn = gluon.loss.L1Loss().

The trouble is that nvidia-smi reports that the process is correctly allocated into GPU. However, running fit in GPU is not much faster than running it in CPU. In addition, increasing batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).

Specifically, for a 50k batch: * output = self.model(batch.data[0]) takes 0.03 seconds on GPU, and 0.08 on CPU. * loss.backward() takes 0.11 seconds, and 0.39 on CPU.

both assessed with nd.waitall() to avoid asynchronous calls leading to incorrect measurements.

In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.

Any ideas on what might be happening here?

Thanks in advance!

score 0 · Answer 1 · answered Apr 04 '19 at 16:52

0

The problem is in the following line:

cumulative_loss += nd.mean(loss).asscalar()

When you call asscalar(), MXNet has to implicitly do synchronized call to copy the result from GPU to CPU: it is essentially the same as calling nd.waitall(). Since you do it for every iteration, it is going to do the sync every iteration degrading your wall clock time significantly.

What you can do is to keep and update your cumulative_loss in GPU and copy it to CPU only when you actually need to display it - it can be every N iterations or after epoch is actually done, depending on how long does it take to do each iteration.

answered Apr 04 '19 at 16:52

Sergei

1,617
15
31

Thanks for taking to the to provide your answer. However, I just tried that and updating the loss only every 500 iterations, takes up some 8GB of GPU memory, and time per epoch is still ~900 seconds, whereas the MXNet implementation took some 90 seconds, with 295MB... We're talking about an order of magnitude here. There's still something fishy I'm missing. I read somewhere about `loss.hybridization()`. Do you think that might help? Also, `self.model.hybridization()` didn't seem to provide much improvement, even though I thought it would. – Germán Sanchis Apr 04 '19 at 19:41
1

Ok, so I finally got an important breakthrough. The former MXNet code was implemented using `mx.symbol.Embedding([...], sparse_grad=True)`. I included sparse gradients for the Embedding layer in the Gluon model and time per epoch dropped from 900 seconds to 120. Still not the 90 seconds of the MXNet implementation, but at least it's not an order of magnitude away. – Germán Sanchis Apr 04 '19 at 19:53

Gluon code not much faster on GPU than on CPU

1 Answers1