How can I break down large grid searches?

Question

I am looking to run a very large grid search for different neural network configurations. In its fullness this would be impracticable to run using my current hardware. I am aware that there may be superior techniques to a naive grid search (e.g. random, Bayesian-optimization) however my question is about what reasonable assumptions we can make about what to include in the first place. Specifically, in my case I am looking to run a grid search between

A: number of hidden layers
B: size of hidden layer
C: activation function
D: L1
E: L2
F: dropout

One idea I had is to (1) identify a network configuration c by running a grid search on A-C, (2) select c with the lowest (e.g. MSE) error (against the test datasets), and (3) run network with configuration c through a separate grid search on D-F to identify the most appropriate regularization strategy.

Is this a sensible approach to take in this case or could I, in theory, get a lower final error (i.e. after regularization) by using a network configuration that showed a higher error in the first grid search (i.e. A-C)?

Hi, is there a reason, why you don't want to use Bayesian search like with hyperopt or Optuna? — jottbe, Oct 30 '20 at 20:21

score 0 · Answer 1 · answered Oct 27 '20 at 14:45

What you mentioned is a reasonable approach. It is an analogy to the so called greedy forward feature selection method which is used to select features. In your cases, it is model parameters instead of features.

The idea is valid and being used widely in practice. No matter how powerful your hardware is, it is never powerful enough to try possible combination, which is basically infinite.

However, there is no guarentee in the approach that the best one in the first grid search will be overall the best one. As you said, you could get a lower final error by using a netfork confuguration that had a higher error in the first grid search. But in practice, the difference should not be much.

I would suggest you to start with fundumental parameters. Such as learning rate, or optimizer. Their effect should much more than other parameters as activation function, number of hidden layers(if you are not comparing a single layer with very deep network but rather 1-2 layers of difference). When you find the best configuration, you should try-out the important ones(lr, optimizer) once again while keeping the found configuration same.

score 0 · Answer 2 · answered Oct 29 '20 at 20:01

In your case (A-C) is mostly about the network architecture and (D-F) is regularization. Deeper networks would (theoretically) outperform larger networks provided sufficiently enough data. While you are not specifying this in your questions, you should keep it mind. My suggestion is:

start with a small network, with relu as activation function and very few layers.
Keep the hidden layer size in line with the input size: if your input it 300-dim, do something like 300d - 150d - output size.
Use dropout.
Try the optimizer between SGD, Adam, RMSProp starting with default values for the learning rate.

Play with some of these parameters to get a feeling. This will drive your further search.

Don't forget to check the difference in loss and your primary metric between train and validation tests. Also, forget using the test set in this steps. The test set is for your final evaluation. Last comment: Check popular approaches on such issues, like the Learning Rate Finder, Cyclical learning rates etc, that have been shown to perform quite well. Last, do not forget that while these will give you a boost (most of the times smaller than expected), feature engineering can go you a long way.

How can I break down large grid searches?

2 Answers2