MLP: When Reduced # Hiddens Fails for Over Training

Question

I am in a epic debate with a colleague who claims that reducing the number of hiddens is the best way to deal with over training.

While it can be demonstrated that generalization error decreases with training of such a net, ultimately it will not reach the level that more hiddens and early stopping can achieve.

I believe our project has many types of ill-"conditioning" of which nonstationarity is just one. I believe large numbers of hiddens are required to handle these issues which could be likened to classes of inputs.

While this seems intuitive to me, I can't make a convincing argument.

"hiddens" = hidden units or hidden layers? Either way, he's wrong as it's definitely not the "best way to deal with overtraining" - there are much better methods for that than tweaking the architecture. In fact, he should provide a reference for his claim. I don't see you in the burden of proof. — runDOSrun, Jul 09 '15 at 19:00
Neural networks are a black box. You are both wrong for arguing about it for any meaningful amount of time instead of just testing it with cross validation. — IVlad, Jul 09 '15 at 20:28
I am running Ward Systems NeuroShell 2 which has no embellishments like Regularization. Further, Cross Validation is rejected by my colleague because he claims a better solution is out there after tremendous amounts of training and that the CV stopping solution is basically a naive solution. So there is no way to disprove that. So I'm back to trying to shed intuitive light on my original question - draw me a picture... — user5070630, Aug 13 '15 at 15:57
(having trouble with this format on this site - can't answer specific posts, nor create paragraphs, so how to discuss technical matters needing long, detailed replies? this is NOT texting!) "...better way to deal with overtraining..." with regard to reducing hiddens - I'm "all ears" on that one - everything I can remember reading on the subject never suggested inferiority of reduced hiddens as a method. I believe there are problems with reduced hiddens such as nonstationarity but can't find clear arguments for this thought of mine. I lack the math (I'm not Bishop) - I'm an engineer. — user5070630, Aug 13 '15 at 16:04
On Cross Validation: I've found that I can arbitrarily "set" the amount of training by Validation Set size. If the VS is small, training goes on for a long time. If the VS is large, training halts in a short time. So selecting VS is based on how well generalization occurs which is basically the amount of training time. So why not forget about VS and simply find a training time that generalizes best? Again, I'm dealing with non-stationary inputs. — user5070630, Aug 14 '15 at 20:43

score 0 · Answer 1 · answered Jul 09 '15 at 19:51

One of the most basic arguments is, that method should have a strong theoretical justification and useful implication. In particular, while number of hidden units can be use to reduce overfitting its main drawbacks are:

hard theoretical analysis - you can't really tell what difference makes adding two more neurons, while you can exactly say what changes when you change regularization strength
finite set of possible states - you can only have integer values of hidden units, leading to finite family of models you are considering; while using regularization (even simple L2 reg) gives you continuum of possible models due to the use of real regularization parameter

MLP: When Reduced # Hiddens Fails for Over Training

1 Answers1