Theano CPU Multicore lower utilisation on a larger dataset

Question

I have a Multilayer Perceptron code using Theano/Lasagne that when I run on a small dataset uses multiple cores correctly.

But, when I run it over a much larger dataset (with the same code) and I watch CPU utilisation in htop, I don't see it parallelize. It creates the number of processes as defined in .theanorc file which looks like:

[global]
OMP_NUM_THREADS=15
openmp=True
floatX = float32
[blas] 
ldflags=-L/usr/lib/ -lblas -lgfortran

but most of the time (~ 90%) there is only one of the created processes is working (though some short amount of times utilisation goes up).

I imagine there is an operation that is not utilising multi-core while other operations are using it because when I run the code on a small dataset I see that most of the time all cores are utilised. All the operations are matrix multiplication (both sparse and dense) so I don't know why some operations don't use multi-core.

This is the section of the code that is being run:

for n in xrange(self.n_epochs):
    x_batch, y_batch = self.X, Y_train
    l_train, acc_train = self.f_train(x_batch, y_batch, self.train_indices)
    l_val, acc_val = self.f_val(self.X, Y_dev, self.dev_indices)
    val_pred = self.f_predict(self.X, self.dev_indices)
    if acc_val > best_val_acc:
        best_val_loss = l_val
        best_val_acc = acc_val
        best_params = lasagne.layers.get_all_param_values(self.l_out)
        n_validation_down = 0
    else:
        #early stopping
        n_validation_down += 1
    logging.info('epoch ' + str(n) + ' ,train_loss ' + str(l_train) + ' ,acc ' + str(acc_train) + ' ,val_loss ' + str(l_val) + ' ,acc ' + str(acc_val) + ',best_val_acc ' + str(best_val_acc))
    if n_validation_down > self.early_stopping_max_down:
        logging.info('validation results went down. early stopping ...')
        break

My numpy/blas information:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

Note that the input matrices are sparse and I'm doing some extra sparse multiplications within custom layers using S.dot.

could you try [Intel Theano](https://github.com/intel/Theano) which has lots of optimizations and MKL as backend? — Patric, Feb 07 '17 at 03:37
@Patric I'd like to see if I can do something within the current environment but I'll be happy to test Intel-Theano if I couldn't solve the problem. Thanks for the intro. — Ash, Feb 07 '17 at 03:48
thanks, if you can provide a reproducible case, I can take a try as well. — Patric, Feb 07 '17 at 03:53

Theano CPU Multicore lower utilisation on a larger dataset

0 Answers0