1

I would like to show an example of a model that overfit a test set and does not generalize well on future data.

I split the news dataset in 3 sets:

train set length: 11314
test set length: 5500
future set length: 2031

I am using a text dataset and build a CountVectorizer. I am creating a grid search (without cross-validation), each loop will test some parameters on the vectorizer ('min_df','max_df') and some parameter on my model LogisticRegression ('C', 'fit_intercept', 'tol', ...). The best result I get is:

({'binary': False, 'max_df': 1.0, 'min_df': 1},
{'C': 0.1, 'fit_intercept': True, 'tol': 0.0001},
 test set score: 0.64018181818181819,
 training set score: 0.92902598550468451)

but now if I run it on the future set I will get a score similar to the test set:

clf.score(X_future, y_future): 0.6509108813392418

How can I demonstrate a case where I overfitted my test set so it does not generalize well to future data?

rolele
  • 781
  • 9
  • 24

2 Answers2

2

You have a model trained on some data "train set".

Performing a classification task on these data you get a score of 92%.

Then you take new data, not seen during the training, such as "test set" or "future set".

Performing a classification task on any of these unseen dataset you get a score of 65%.

This is exactly the definition of a model which is overfitting: it has a very high variance, a big difference in the performance between seen and unseen data.

By the way, taking into account your specific case, some parameter choices which could cause overfitting are the following:

  • min_df = 0
  • high C value for logistic regression (whihc means low regularization)
alsora
  • 551
  • 5
  • 17
  • Thank you, it is pretty clear. I know that it is possible to build a model that will overfit the test set instead of the training set. Can you think of a strategy to generate such model? Because I am using the score on the test set to adjust the parameters, this could happen theoretically but this is pretty difficult to do. – rolele Mar 13 '18 at 23:00
  • There is no way to "over-fit" the test set because overfit implies that something negative. A theoretical model that fits the test set at 92% but fits the training set to only 65% is a very good model indeed (assuming your sets are balanced). Also don't use the test set to adjust parameters. EDIT: I think what you are referring to as your "test set" might actually be a validation set, and your "future set" is actually the test set. – Tryer Mar 13 '18 at 23:32
  • Yes you are right. To give you all the information I am trying to make sense of this video I saw https://www.youtube.com/watch?v=Sombn6OSvZU . And I am trying to simulate what happened to the Kaggle user that when down on the private dataset. It seems that they were doing very well on the public test dataset but not the private test dataset So I guess they use the public test dataset as validation test (and they overfit it) and did not generalize well to future data (the private test dataset) – rolele Mar 13 '18 at 23:42
0

I wrote a comment on alsora's answer but I think I really should expand on it as an actual answer.

As I said, there is no way to "over-fit" the test set because over-fit implies something negative. A theoretical model that fits the test set at 92% but fits the training set to only 65% is a very good model indeed (assuming your sets are balanced).

I think what you are referring to as your "test set" might actually be a validation set, and your "future set" is actually the test set. Lets clarify.

You have a set of 18,845 examples. You divide them into 3 sets.

Training set: The examples the model gets to look at and learn off of. Every time your model makes a guess from this set you tell it whether it was right or wrong and it adjusts accordingly.

Validation set: After every epoch (time running through the training set), you check the model on these examples, which its never seen before. You compare the training loss and training accuracy to the validation loss and validation accuracy. If training accuracy > validation accuracy or training loss < validation loss, then your model is over-fitting and training should stop to avoid over-fitting. You can either stop it early (early-stopping) or add dropout. You should not give feedback to your model based on examples from the validation set. As long as you follow the above rule and as long as your validation set is well-mixed, you can't over-fit this data.

Testing set: Used to assess the accuracy of your model once training has completed. This is the one that matters because its based on examples your model has never seen before. Again you can't over-fit this data.

Of your 18,845 examples you have 11314 in training set, 5500 in the validation set, and 2031 in the testing set.

Tryer
  • 120
  • 1
  • 7
  • Thanks Tryer, you are right. I am trying to simulate the mistake that some Kaggle user did in this video https://www.youtube.com/watch?v=Sombn6OSvZU they basically overfit the public test dataset and fail to generalize on the private dataset. I am trying to simulate what happened to them using a simple example. – rolele Mar 13 '18 at 23:49
  • If I was to really solve the problem, I will perform a cross-validation of the training set using kfold to create many different validation sets and then use the test set to check it generalize well but what I am trying to do is simulate the mistake some people did (as you can see on my previous comment) – rolele Mar 13 '18 at 23:52
  • @rolele Ok, I understand. One way it can happen that a model does well on the validation set but poorly on the test set, is if the validation set is not well mixed compared to the test set. As an example consider a task with 3 classes. If the test set is divided evenly (33% of examples per class), but the validation set is divided poorly (saying 80% class 1, 10% class 2, 10% class), then that would lead to the same phenomenon. I'm not sure how to craft such a situation for your problem, but maybe purposely imbalancing the validation set may achieve it. – Tryer Mar 13 '18 at 23:55
  • Yes I see. Now that I am thinking out of the box, it could be possible that those users mixed the public test set in the training set and train the whole thing. Doing that you will do well on the public test set because it will overfit. – rolele Mar 14 '18 at 00:03