0

I have 3.25 years of time-based data and I'm using scikit-learn's RandomForestClassifier to try and classify live data as it comes in. My dataset has roughly 75,000 rows and 1,100 columns, and my train/test split is the first 3 years for train (66,000 rows), and the last 0.25 years (3 months or 9,000 rows) for test.

Since there's variability each time you train, I don't always see good precision on classifying the test data...but sometimes I do. So what I've tried doing is re-training the classifier over and over until I do see good precision on classifying the test data, then save that version to disk for use in live classification as new data comes in.

Some may say this is over-fitting the model to the test data...which is likely true, but I have decided that, due to randomness in training, finding a good fit on the first iteration versus the 100th makes no difference, because the iteration in which a good fit occurs happens completely by chance. Hence my determination to keep re-training until I find a good fit.

What I've seen is that I can find a fit that will have good/stable precision for the entire 3 months of the test period, but then when I use that model to classify live data as it comes in for the 4th month it's not stable, and the precision is drastically worse.

Question 1: how could a model have great/stable precision for 3 months straight but then flounder in the 4th month?

Question 2: how can I change or augment my setup or process to achieve classification precision stability on live data?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
brettlyman
  • 1,909
  • 2
  • 11
  • 7
  • > finding the good fit on the first iteration versus the 100th makes no difference This heavily depends on the amount of test data and in general wrong. Please read about multiple hypothesis testing. – Alleo Jun 03 '16 at 10:27
  • 2
    You might consider asking this at http://stats.stackexchange.com/ instead. But you are definitely not using test data in the right way. I'd say there is a good chance you are overfitting your data, as it is exactly how you should not be using a test set - use CV on training data instead to find good fits. And in general, the more things you try (e.g., run training over and over until it suddenly works) the higher the chance you will overfit and be disappointed in real tests. However, there is also some chance your test set is simply not representative of the real data. – Alex A. Jun 03 '16 at 10:32
  • @AlexA. - regarding your statement that the test set might not be representative of the real data, would you suggest doing a random train/test split of the dataset via something like sklearn.cross_validation.train_test_split, rather than holding out a contiguous period of data? – brettlyman Jun 03 '16 at 10:41
  • 1
    If there is enough data to go around, I would prefer both. Hold out a period of data later, when you have picked a good model, and use for example cross_validation.StratifiedKFold or train_test_split on the training set to form validation sets during training. And if your model has a high variance in accuracy, I would consider tweaking it's parameters like n_estimators instead of just running N times until it works, as this will very likely give misleading results. – Alex A. Jun 03 '16 at 10:54
  • However, to be clear, if the test set is not representative of real data, that might also be true for the training data, and then you have a much bigger problem. See if you can verify the sources of the training data, test data and live data, and if they differ in some way. – Alex A. Jun 03 '16 at 12:20
  • 1
    You could also try ensembling different models (e.g. RF + GBM) or increasing the number of trees in your RF. – Stergios Jun 03 '16 at 14:13
  • @AlexA. - the train/test/live data all come from the same source, only vary by what's actually happening during those times. – brettlyman Jun 03 '16 at 15:02
  • How many trees are you using? If you're using the default, you only have 10, which would be too few and would explain your variability. Also, what's the balance of your classes? – Tchotchke Jun 03 '16 at 15:50
  • @Tchotchke - I'm using 32 trees, and it's binary classification of which 75% are 0, and the other 25% are 1 – brettlyman Jun 03 '16 at 17:50
  • The number of trees is your problem - you'll likely need a few thousand. I'd highly recommend reading an Introduction to Statistical Learning and then asking any follow up questions on Cross Validated. – Tchotchke Jun 03 '16 at 18:11
  • @Tchotchke - I've read the materials, and don't remember suggestions on the number of trees to use, and usually the examples are pretty basic and use a very small number of trees. Usually they say "it depends on the dataset", etc. Can you point me to a specific article/book/link? – brettlyman Jun 03 '16 at 19:51
  • 1
    There's no suggestion because it's a tuning parameter, just like `mtry`. However, you can see from the charts in the RF chapter that the error will level off as you increase the number of trees. That point would be approximately the same each time you run a RF, but will differ by dataset. Since RF almost never overfits, you can use a relatively high number of trees (>500 or even a few thousand) and you won't see the variability you're seeing. I'd also recommend looking at other algorithms - `xgboost` is particularly popular at the moment. – Tchotchke Jun 03 '16 at 21:39

1 Answers1

2

If you do this approach, you need another test set.

What you are doing is validation. There is indeed a big risk of overfitting to the test set.

Split your data into three parts: 80% training, 10% validation, 10% test.

Train multiple classifiers, keep the one that performs best on the validation set. Use the test set to verify that you indeed have a working classifier. If the performance on the validation set and test set differs a lot, that is very bad news (test this on all your classifiers!)

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194