I am looking to run a very large grid search for different neural network configurations. In its fullness this would be impracticable to run using my current hardware. I am aware that there may be superior techniques to a naive grid search (e.g. random, Bayesian-optimization) however my question is about what reasonable assumptions we can make about what to include in the first place. Specifically, in my case I am looking to run a grid search between
- A: number of hidden layers
- B: size of hidden layer
- C: activation function
- D: L1
- E: L2
- F: dropout
One idea I had is to (1) identify a network configuration c
by running a grid search on A-C, (2) select c
with the lowest (e.g. MSE) error (against the test datasets), and (3) run network with configuration c
through a separate grid search on D-F to identify the most appropriate regularization strategy.
Is this a sensible approach to take in this case or could I, in theory, get a lower final error (i.e. after regularization) by using a network configuration that showed a higher error in the first grid search (i.e. A-C)?