model selection with Keras and Theano takes a very long time

Question

I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.

To perform model selection, I compare 30 models sampled from the parameter space using

param_grid = {
             'nb_hidden_layers': [1, 2, 3],
             'dropout_frac': [0.15, 0.20],
             'output_activation': ['sigmoid', 'softmax'],
             'optimization': ['Adedelta', 'RMSprop', 'Adam'],
             'learning_rate': [0.001, 0.005, 0.010],
             'batch_size': [64, 100, 150, 200],
             'nb_epoch': [10, 15, 20],
             'perform_batchnormalization': [True, False]
             }
params_list = list(ParameterSampler(param_grid, n_iter = 30))

I then construct a RNN model using the function NeuralNetworkClassifier() defined below

def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
    nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)

    layers = [8]    # number of features in every week
    layers.extend(nb_units_in_hidden_layers)
    layers.extend([1])  # node identifying quit/stay

    model = Sequential()

    # constructing all layers up to, but not including, the penultimate one
    layer_idx = -1  # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
    for layer_idx in range(len(layers) - 3):
        model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True))    # all LSTM layers, up to and including the penultimate one, need return_sequences = True
        if params['perform_batchnormalization'] == True:
            model.add(BatchNormalization())
            model.add(Activation('relu'))
        model.add(Dropout(params['dropout_frac']))
    # constructing the penultimate layer
    model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
    if params['perform_batchnormalization'] == True:
        model.add(BatchNormalization())
        model.add(Activation('relu'))
    model.add(Dropout(params['dropout_frac']))
    # constructing the final layer
    model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
    model.add(Activation(params['output_activation']))

    if params['optimization'] == 'SGD':
        optim = SGD()
        optim.lr.set_value(params['learning_rate'])
    elif params['optimization'] == 'RMSprop':
        optim = RMSprop()
        optim.lr.set_value(params['learning_rate'])
    elif params['optimization'] == 'Adam':
        optim = Adam()
    elif params['optimization'] == 'Adedelta':
        optim = Adadelta()

    model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])

    return model

which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers' in param_grid and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]. At the end, this function compiles the model and returns it.

During the nested cross-validation (CV), the inner loop (which runs IN times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT times. Therefore, I am compileing a RNN model OUTxINx30 times, and this takes an extremely long time; for example, when OUT=4 and IN=3, my method takes between 6 to 7 hours to finish.

I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.

My questions:

Is there a way to remedy this situation?
Is compile actually done on the CPU?
How do people do nested CV to select the best RNN architecture?
Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?

Thank you all.

The comment by nikicc [here](https://github.com/fchollet/keras/issues/2689) suggests `.compile()`ing once during the very first fold and reusing the initial weights for the remaining folds in cross-validation. Trying this has given me a big speed boost. — darXider, Jan 17 '17 at 16:26

score 2 · Answer 1 · answered Jan 13 '17 at 05:19

I can't answer all your questions, still hope it helps.

Compilation is done in CPU because it's mainly composed of symbolic graph operations and code generation. To make things worse, theano graph optimization uses pure python code, which can be an overhead compared to a C/C++ implementation.

To improve theano compilation time (at the cost of runtime performance):

Use less aggressive optimization

In /home/ec2-user/.theanorc add line:

optimizer = fast_compile

Or totally disable optimization with:

optimizer = None

Precompile some blocks

If there are common blocks shared amoung your models, you can precompile them with theano.OpFromGraph

You can't do this in Keras alone, though.

Switch framework

Keras does support tensorflow backend. Compared to theano, tensorflow work more like a VM than a compiler. Typically TF runs slower than theano but compiles much faster.

Thank you for your help. I will try the first suggestion. I can't switch from Theano though (i have been asked to use Theano). — darXider, Jan 16 '17 at 17:58

model selection with Keras and Theano takes a very long time

1 Answers1

Use less aggressive optimization

Precompile some blocks

Switch framework