Overfitting, here, is likely due to the model being oversized for the data/task. It does have enough internal state/complexity to memorize your training-set, including nonsense idiosyncratic details about individual examples that help it look up (rather then generalizably deduce) the 'right answer' for those.
(An interesting comparison to apply if you save the model to disk, is it larger – perhaps much larger – than your training data? In a very real sense, lots of machine-learning is compression, and any time your model is close to, or larger than, the size of the training data, overfitting is likely.)
In such cases, two major things to try are to get more data, or shrink the model - so that's it's forced to learn rules, not become a big 1:1 lookup table.
The main ways to shrink the model:
- shrink the vectors (lower
-dim
)
- discard the bigram features (return to default
-wordNgrams 1
)
- discard more of the rarer words/bigrams (
-minCount
higher than the default - keeping rare words often weakens word-vector models, and in a classification task, any singleton words always associated in training with a single label are highly likely to overwhelm other influences, if they're not truly reliable signals)
- reduce the number of subword/wordgram buckets (lower
-bucket
values)
Separately, -lr 1.0
is way, way higher than typical values of 0.025
to (supervised
mode default 0.1
), so that might be worth changing to a more typical range, too.
With regard to the idea that proper regularization might remedy any amount of model oversizing, as suggested in your comment:
The Fasttext algorithm & its common implementations don't specify any standard or proven regularizations that can fix an oversized model. Choosing one or more approaches, and adding them to the operation of a major Fasttext implementation, & evaluating their success, would involve your own customizations/extensions.
Further, I've not noticed any work demonstrating a regularization that can remedy an oversized shallow-neural-network word-vector model (like word2vec or Fasttext). Though I may have overlooked something – & in such a case would love pointers! – that suggests it may not be a preferred approach, compared to the usual "shrink model" or "find more data" tactics.
Looking up the context of Ng's quote, he's talking about the circumstances of "modern deep learning", with the additional caveat of "so long as you can keep getting more data".
Further, word-vector algorithms like word2vec or Fasttext aren't really 'deep' learning – they use only a single 'hidden layer'. While definitions vary a bit, & these algorithms are definitely a stepping-stone to deeper neural-networks, I believe most practicioners would call them "shallow learning" using only a "shallow neural network".
Here's Ng's quote, as attributed to a Coursera lecture here, with more context and my added emphases and paragraph breaks:
So a couple of points to notice. First is that, depending on whether you have high bias or high variance, the set of things you should try could be quite different. So I'll usually use the training dev set to try to diagnose if you have a bias or variance problem, and then use that to select the appropriate subset of things to try.
So for example, if you actually have a high bias problem, getting more training data is actually not going to help. Or at least it's not the most efficient thing to do. So being clear on how much of a bias problem or variance problem or both can help you focus on selecting the most useful things to try.
Second, in the earlier era of machine learning, there used to be a lot of discussion on what is called the bias variance tradeoff. And the reason for that was that, for a lot of the things you could try, you could increase bias and reduce variance, or reduce bias and increase variance.
But back in the pre-deep learning era, we didn't have many tools, we didn't have as many tools that just reduce bias or that just reduce variance without hurting the other one.
But in the modern deep learning, big data era, so long as you can keep training a bigger network, and so long as you can keep getting more data, which isn't always the case for either of these, but if that's the case, then getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately.
And getting more data pretty much always reduces your variance and doesn't hurt your bias much.
So what's really happened is that, with these two steps, the ability to train, pick a network, or get more data, we now have tools to drive down bias and just drive down bias, or drive down variance and just drive down variance, without really hurting the other thing that much.
And I think this has been one of the big reasons that deep learning has been so useful for supervised learning, that there's much less of this tradeoff where you have to carefully balance bias and variance, but sometimes you just have more options for reducing bias or reducing variance without necessarily increasing the other one.
And, in fact, [inaudible] you have a well regularized network. We'll talk about regularization starting from the next video. Training a bigger network almost never hurts. And the main cost of training a neural network that's too big is just computational time, so long as you're regularizing.
So, it'd likely be an interesting experiment/article as to whether usual regularization techniques can cure shallow word-vector model overfitting, including in extreme cases where the model remains larger than the training data.
But such a hypothesized fix isn't available as an off-the-shelf option.