What is going on here (bottom line)?
What you experience is a strong sign of over-fitting. Details follow.
How is this even possible?
Both regularization and random hash-collisions are discounting mechanisms for a subset of model features. They pick some features, and make them less important (or completely discounted) in the model.
But that's where the similarity ends. The two discounting mechanisms are very different in:
- The subset of features they discount (specific vs random)
- The discounting method (complete vs partial-blend)
- The discounting direction and magnitude
Specific vs random discounting
L1 regularization (--l1 ...
) picks very specific weights (the ones nearest zero vs the norm) whereas random hash collisions "picks" random weights, some of which may be large.
Complete vs partial-blend discounting
L1 regularization completely trims/removes the weights it selects, whereas random-hash collisions create blends with other features. In one sense (effect-mixing) blends are similar, though not identical, to what vw
feature-crossing with -q <XY>
does.
Direction & magnitude of discounting
Unlike specific selection by absolute value of weight, random selection can affect weights of any magnitude.
Moreover, a feature might be a bad feature when considered alone, but actually contribute when considered in combination with another feature. One example would be a blending of two bad features, one with a positive and one with a negative weight. By partially canceling each other they might result in a plausibly good feature that is somewhat correlated with the target label. IOW: feature blending can sometimes turn a bad effect into a good effect.
How can discounting help (or hurt) a model?
This is very common in Machine Learning, and especially in big-data problems. Whether feature selection, trimming, or blending improves accuracy is data-dependent.
If it happens to discount a "bad" feature (one that agrees with the training-data, but doesn't help in generalization) it makes the model better. However, if we discount a good, well-generalizing feature, it would make the model appear worse in testing (out of sample data).
Related: the idea of randomly discounting or dropping features, even potentially important ones, has been proven to be a a powerful and beneficial technique in deep neural-nets (NNs), where it is called dropout. dropout has become the de-rigueur method to avoid over-fitting in deep learning.
Bottom line
Creating good models takes practice. When you have a very large number of features, over-fitting due to random effects (both small and/or large weights) is possible. This over-fitting needs to be avoided.
There are many ways to avoid over-fitting. Regularization is just one specific way to reduce over-fitting. Sometimes, a more significant, random approach, which affects all features, not just those with low weights, might be beneficial overall.
When this happens, it is a hint that the number of features may be too large, and you may be over-fitting them.
Generally I would suspect any model where the number of examples (data-set rows) used to train, is not much larger than the number of features (distinct data-set columns). If you have hundreds of thousands (10^6) of features, you should probably have 10^12 (trillions) of examples to avoid over-fitting.
Another thing I would do with a large number of features, is to randomly-shuffle the order of examples, and blend several models to ensure that a specific order doesn't cause over-fitting. Online learning tends to overweight early examples over later ones due to learning rate decay.