Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

Question

I know the general rule that we should test a trained classifier only on the testing set.

But now comes the question: When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? Or do I have to apply it to a new predicting set that is different from the training+testing set?

And what if I predict a label column of a time series (edited later: I do not mean to create a classical time series analysis here, but just a broad selection of columns from a typical database, weekly, monthly or randomly stored data that I convert into separate feature columns, each for one week / month / year ...), do I have to shift all of the features (not just the past columns of the time series label column, but also all other normal features) of the training+testing set back to a point in time where the data has no "knowledge" interception with the predicting set?

I would then train and test the classifier on features shifted to the past by n months, scoring against a label column that is unshifted and most recent, and then predicting from most recent, unshifted features. Shifted and unshifted features have the same number of columns, I align shifted and unshifted features by assigning the column names of the shifted features to the unshifted features.

p.s.:

p.s.1: The general approach on https://en.wikipedia.org/wiki/Dependent_and_independent_variables

In data mining tools (for multivariate statistics and machine learning), the dependent variable is assigned a role as target variable (or in some tools as label attribute), while an independent variable may be assigned a role as regular variable.[8] Known values for the target variable are provided for the training data set and test data set, but should be predicted for other data.

p.s.2: In this basic tutorial we can see that the predicting set is made different: https://scikit-learn.org/stable/tutorial/basic/tutorial.html

We select the training set with the [:-1] Python syntax, which produces a new array that contains all > but the last item from digits.data: […] Now you can predict new values. In this case, you’ll predict using the last image from digits.data [-1:]. By predicting, you’ll determine the image from the training set that best matches the last image.

score 2 · Answer 1 · answered Dec 06 '19 at 13:44

2

I think you are mixing up some concepts, so I will try to give a general explanation for Supervised Learning.

The training set is what your algorithm LEARNS on. You split it in X (features) and Y (target variable).
The test set is a set that you use to SCORE your model, and it must contain data that was not in the training set. This means that a test set also has X and Y (meaning that you know the value of the target). What happens is that you PREDICT f(Y) based on X, and compare it with the Y you have, and see how good your predictions are
A prediction set is simply new data! This means that usually you DO NOT have a target, since the whole point of supervised learning is predicting it. You will only have your X (features) and you will predict f(X) (your estimate of the target Y) and use it for whatever you need.

So, in the end a test set is simply a prediction set for which you have a target to compare your estimation to.

For time series, it is a bit more complicated, because often the features (X) are transformations on past data of the target variable (Y). For example, if you want to predict today's SP500 price, you might want to use the average of the last 30 days as a feature. This means that for every new day, you need to recompute this feature over the past days.
In general though, I would suggest starting with NON time series data if you're new to ML, as Time Series is much harder in terms of feature engineering and data management and it is easy to make mistakes.

answered Dec 06 '19 at 13:44

Davide ND

856
6
13

Thank you! What I am doing now is to shift all features to the first database timestamp, month or date *at least 1 month before label date*, which is mostly a shift to month "-1", and then I train and test the classifier with the labels of month "0". With that classifier, I will use features of month "0" to predict labels of month "+1". I must also use the past time-series labels as features, thus I rename the predicting set's feature-labels from Feb-Dec to the training/testing set's feature-labels Jan-Nov so that the classifier recognizes the names. – questionto42 Dec 06 '19 at 19:07
It is just strange that all of this time shifting is not mentioned in basic tutorials. It is quite some work if you have a lot of features and some time-series features as well. If I want to have my classifier to be trained on the most recent possible data, I will have to train it on data shifted slightly into the past, if my database is not just a collection of pictures to analyse, but a normal database of columns changing over time. I have not found any basic tutorial or question about this after normal search time. – questionto42 Dec 07 '19 at 18:02
You have to look for Time Series analysis tutorials. Usually it is easier just to shift forward the target column rather than shifting back all the others. Good luck! – Davide ND Dec 09 '19 at 09:10
That would not solve it, because the target column is not the issue. The features have to be shifted in time to get the training features ready. Example. If I use a monthly available feature like let's say the age of a person at monthly level, and that person is 33 years and 10 months old and has some other monthly attributes at this very moment, while I am predicting her still unknown IQ next month as the target, 2 feature datasets are needed. The most recent predicting features must all be shifted into the past by at least one month to get the training features with the labels of now ready. – questionto42 Dec 09 '19 at 13:07
It's the same thing. The only thing you care is that target and features are delayed by 1 month. And by shifting I mean shifting the vector by one position, you don't go and manually remove one month from the columns! – Davide ND Dec 09 '19 at 13:10
I shall keep training/testing set and predicting set simply separate, said a colleague, not in the same loop. I shall then train the classifier with training/testing set of whatever months I choose in the past. Then I shall save it and apply it in a new code to whatever more recent predicting features that I want to pick. She said that it is also not needed to train a classifier for every prediction, but having one classifier for a year or quarter trained on one big training set. I will still integrate the shift now and parametrize if you want a new classifier or not, easier now :))) – questionto42 Dec 09 '19 at 13:35
Aaah ok I get you, you meant a shift of the whole dataset. Yes, right, misunderstanding OK, I meant this shift here as well and I will not change my code anymore but implement shifted training/testing set and predicting set in one go. Thank you for all. Aaah I see again, you mean not to shift the column names for the time series columns, but just the array by one col in these cases... OK, got it. – questionto42 Dec 09 '19 at 13:35

questionto42 · Accepted Answer · 2023-01-06T20:49:09.580

The question above When I have an already trained and tested classifier ready, can I apply it to the same dataset that was the base of the training and testing set? has the simple answer: No.

The question above Do I have to shift all of the features has the simple answer: Yes.

In short, if I predict a month's class column: I have to shift all of the non-class columns also back in time in addition to the previous class months I converted to features, all data must have been known before the month in that the class is predicted.

This also means: the predicting set has to be different from the dataset that contains the testing set. If you included the testing set, the training set loses valuable up-to-date data of the latest month(s) available! The term of a final "predicting set" is meant to be the "most current input to be used without a testing set" to get the "most current results" for the prediction.

This is confirmed by the following overview offered by this user who seems to have made the image, using days instead of months here, but the idea is the same:

Source: Answer on "Cross Validated" - Splitting Time Series Data into Train/Test/Validation Sets; the whole Q/A is recommended (!).

See the last line of the image and the valuable comments of that answer on "Cross Validated" to understand this.

230106:

The image shows that the last step is a training on the whole dataset, this is the "predicting set" that is the newest and that does not have a testing set.

On that image, there is one "mistake" which shows that this seemingly easy question of taking former labels as features for upcoming labels seems to be hard to be understood. I myself did not see this and posted the image without this remark: The "T&V" is in the past of the "Test". And that would be a wrong validation for a model that shall predict the future, the V must be in the "future" test block (unless you have a dataset that is not dynamically changing over time, like in physics).

You would have to change it to a "walk-forward" model, with the validation set - if at all - split k-fold from the testing set, not from the training set. That would look like this:

Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?

2 Answers2

230106:

Linked