0

I know how to utilize a basic train_test_split:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

However, what if I want to divide my training and testing set by a variable, in this case year. I want all values where year==2019 to be my test set while year<2019 is my training set. How can I alter the code above to make that happen?

bismo
  • 1,257
  • 1
  • 16
  • 36

1 Answers1

0

Let me explain with an example:
If your corpus have 1000 data points and you want 700/300 for train/test, find data points with year == 2019 take (move) them to the end of the corpus and consider them as test data with something like bellow: (suppose 200 data point satisfy year == 2019 condition)

X_test, y_test = X[800:1000], y[800:1000]

and for example 300 data points have year < 2019 after moving them to top:

X_train, y_train = X[0:300], y[0:300]

Now for rest of your corpus (from 300 to 800) redefine X and Y like:

X = data.iloc[301:799]
Y = label.iloc[301:799]

and then use train_test_split for new X and Y and join new X_test, y_test, X_train, y_train with the previous ones.