The question is about a wrongly chosen strategy for train/test splitting in a RandomForest model. I know choosing the test set this way gives the wrong output but I would like to know why.
(The model looks at previous days of data and tries to predict if next day's data will be higher or lower than today, i.e. a classification problem)
I copied the train/test split code from another example and it simply sets random rows to be either train_set or test_set. (Tried to illustrate below)
Raw data is daily Close values of for example EURUSD.
I then create features based on that. Every feature looks at some previous data points and comes up with a set of features which is a row in X_test. I then train a random forest model to try to predict next day's close.
The accuracy in the test_set is very high and it increases with the increasing number of historical previous points it looks at which seems to suggest overfitting.
When I change train/test split model to have, for example, train_set: data in January-June and test_set: data in August, i.e. completely separate datasets and no possible mixing, the accuracy is a more realistic 50%.
Again, I know the train/test split it is not correct, but can someone help me understand why..?
Every time I want to validate a row (i.e. one prediction in test_set) I use features that looks at previous data trying to predict tomorrow's data? How come there is overfitting?