I am performing nested cross-validation for model selection and performance estimation for a set of recurrent neural networks with different architectures and parameters using Keras and Theano, which are set up to run on a AWS P2 instance which has a Tesla K80 GPU with CUDA and cuDNN installed/enabled.
To perform model selection, I compare 30 models sampled from the parameter space using
param_grid = {
'nb_hidden_layers': [1, 2, 3],
'dropout_frac': [0.15, 0.20],
'output_activation': ['sigmoid', 'softmax'],
'optimization': ['Adedelta', 'RMSprop', 'Adam'],
'learning_rate': [0.001, 0.005, 0.010],
'batch_size': [64, 100, 150, 200],
'nb_epoch': [10, 15, 20],
'perform_batchnormalization': [True, False]
}
params_list = list(ParameterSampler(param_grid, n_iter = 30))
I then construct a RNN model using the function NeuralNetworkClassifier()
defined below
def NeuralNetworkClassifier(params, units_in_hidden_layer = [50, 75, 100, 125, 150]):
nb_units_in_hidden_layers = np.random.choice(units_in_hidden_layer, size = params['nb_hidden_layers'], replace = False)
layers = [8] # number of features in every week
layers.extend(nb_units_in_hidden_layers)
layers.extend([1]) # node identifying quit/stay
model = Sequential()
# constructing all layers up to, but not including, the penultimate one
layer_idx = -1 # this ensures proper generalization nb_hidden_layers = 1 (for which the loop below will never run)
for layer_idx in range(len(layers) - 3):
model.add(LSTM(input_dim = layers[layer_idx], output_dim = layers[layer_idx + 1], init = 'he_uniform', return_sequences = True)) # all LSTM layers, up to and including the penultimate one, need return_sequences = True
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the penultimate layer
model.add(LSTM(input_dim = layers[layer_idx + 1], output_dim = layers[(layer_idx + 1) + 1], init = 'he_uniform', return_sequences = False)) # the last LSTM layer needs return_sequences = False
if params['perform_batchnormalization'] == True:
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(params['dropout_frac']))
# constructing the final layer
model.add(Dense(output_dim = layers[-1], init = 'he_normal'))
model.add(Activation(params['output_activation']))
if params['optimization'] == 'SGD':
optim = SGD()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'RMSprop':
optim = RMSprop()
optim.lr.set_value(params['learning_rate'])
elif params['optimization'] == 'Adam':
optim = Adam()
elif params['optimization'] == 'Adedelta':
optim = Adadelta()
model.compile(loss = 'binary_crossentropy', optimizer = optim, metrics = ['precision'])
return model
which construct a RNN whose number of hidden layers is given by the parameter 'nb_hidden_layers'
in param_grid
and the number of hidden units in each layer is randomly sampled from the list [50, 75, 100, 125, 150]
. At the end, this function compile
s the model and returns it.
During the nested cross-validation (CV), the inner loop (which runs IN
times) compares the performance of the 30 randomly selected model. After this step, I pick the best-performing model in the outer loop and estimate its performance on a hold-out dataset; this scheme is repeated OUT
times. Therefore, I am compile
ing a RNN model OUT
xIN
x30 times, and this takes an extremely long time; for example, when OUT=4
and IN=3
, my method takes between 6 to 7 hours to finish.
I see that the GPU is being used sporadically (but the GPU usage never goes above 40%); however, most of the time, it is the CPU that is being used. My (uneducated) guess is that compile
is being done on the CPU many many times and takes the bulk of the computing time, whereas model fitting and predicting are done on the GPU and takes a short time.
My questions:
- Is there a way to remedy this situation?
- Is
compile
actually done on the CPU? - How do people do nested CV to select the best RNN architecture?
- Is it reasonable for me to perform this scheme on the production server? Do you suggest I do one big nested CV, that might take 24 hours, to select the best performing model and just use that one model afterwards on the production server?
Thank you all.