I am in a epic debate with a colleague who claims that reducing the number of hiddens is the best way to deal with over training.
While it can be demonstrated that generalization error decreases with training of such a net, ultimately it will not reach the level that more hiddens and early stopping can achieve.
I believe our project has many types of ill-"conditioning" of which nonstationarity is just one. I believe large numbers of hiddens are required to handle these issues which could be likened to classes of inputs.
While this seems intuitive to me, I can't make a convincing argument.