I have a dataset of daily transactions where there are multiple records per day. I need to split it into different cross validation folds to train a ML model, however I can't user TimeSeriesSplit from sklearn as there are multiple transactions per day. Do you know how can I do this in python?
Asked
Active
Viewed 661 times
2

desertnaut
- 57,590
- 26
- 140
- 166

andyolivers
- 43
- 3
-
If the records are ordered by time-stamp, in principle you could still use `TimeSeriesSplit`, after ordering by time-stamp... you would have records from the same day in both training and test, but the important thing is that the records in the training set have a time-stamp lower than those in test set. If you want to avoid to have records from the same day in training and test, then: if the number of records per day is constant you can still use `TimeSeriesSplit`. If it is variable, you can transform days into tuples, pass a list of tuples to `TimeSeriesSplit`, and then convert back? – Jau A Oct 12 '22 at 21:55
-
Thanks for the reply. But then how can I control the test size? I want to implement a rolling window approach. Hence I need to keep the transactions for each day separated from each other – andyolivers Oct 14 '22 at 19:35
-
Could you please provide a small example of input data and desired output? That would help clarify your problem and come up with a solution. – Jau A Oct 14 '22 at 20:44
-
`import numpy as np import pandas as pd data = np.array([['DAY_1','afds',5], ['DAY_1','rtws', 4], ['DAY_1','gtssd', 2], ['DAY_2','ititl', 4], ['DAY_2','uius', 7], ['DAY_3','hyaah', 6], ['DAY_4','apsaj', 9]]) df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE']) df` – andyolivers Oct 18 '22 at 10:45
-
So considering that example, I need to use rows from DAY_1 and DAY_2 to train and then test with DAY_3. Then use DAY_1, DAY_2 and DAY_3 to train, testing with DAY_4. Thank you for your help @JauA – andyolivers Oct 18 '22 at 10:47
-
Thanks @andyolivers, I just posted a solution to this problem. You might want to edit your question to include the example you gave me, in order to make it more clear for other people. – Jau A Oct 18 '22 at 20:51
1 Answers
1
Input data:
import numpy as np
import pandas as pd
data = np.array(
[['DAY_1','afds',5],
['DAY_1','rtws', 4],
['DAY_1','gtssd', 2],
['DAY_2','ititl', 4],
['DAY_2','uius', 7],
['DAY_3','hyaah', 6],
['DAY_4','apsaj', 9]])
df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE'])
Resulting df:
TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
DAY_3 hyaah 6
DAY_4 apsaj 9
Solution:
from sklearn.model_selection import TimeSeriesSplit
df = df.set_index('DATEDAY')
days = np.sort(df.index.unique())
tscv = TimeSeriesSplit(2)
for train_index, test_index in tscv.split(days):
print ('------------------------------')
train_days, test_days = days[train_index], days[test_index]
X_train, X_test = df.loc[train_days], df.loc[test_days]
print ('train:', X_train, '\n')
print ('test:', X_test, '\n')
Output:
------------------------------
train: TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
test: TRANSACTION_ID PRICE
DATEDAY
DAY_3 hyaah 6
------------------------------
train: TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
DAY_3 hyaah 6
test: TRANSACTION_ID PRICE
DATEDAY
DAY_4 apsaj 9
Note 1: we assume that the date column can be sorted. In this example,
DAY_X
doesn't sort well, since DAY_11 would be placed before DAY_2, for instance. If we only know the numberX
of the day, then we need to putX
in the column, instead ofDAY_X
, e.g., we might do something like:
df['DATEDAY'] = [int(x.split('_')[1]) for x in df['DATEDAY']]
Note 2: if we want to avoid having
DATEDAY
as index of the dataframe, we can simply reset the index forX_train
andX_test
:
for train_index, test_index in tscv.split(days):
print ('------------------------------')
train_days, test_days = days[train_index], days[test_index]
X_train, X_test = df.loc[train_days].reset_index(), df.loc[test_days].reset_index()
print ('train:\n', X_train, '\n')
print ('test:\n', X_test, '\n')
Output:
------------------------------
train:
DATEDAY TRANSACTION_ID PRICE
0 DAY_1 afds 5
1 DAY_1 rtws 4
2 DAY_1 gtssd 2
3 DAY_2 ititl 4
4 DAY_2 uius 7
test:
DATEDAY TRANSACTION_ID PRICE
0 DAY_3 hyaah 6
------------------------------
train:
DATEDAY TRANSACTION_ID PRICE
0 DAY_1 afds 5
1 DAY_1 rtws 4
2 DAY_1 gtssd 2
3 DAY_2 ititl 4
4 DAY_2 uius 7
5 DAY_3 hyaah 6
test:
DATEDAY TRANSACTION_ID PRICE
0 DAY_4 apsaj 9

Jau A
- 395
- 1
- 10
-
-
Apologies from my side. I am new here and thought I had to tag it again as unresolved. Anyway, up to you and once again, thanks for your help. – andyolivers Oct 20 '22 at 22:25
-
In order to change the number of days per split, you need to change the parameter `n_splits` passed to `TimeSeriesSplit`, in the statement: `tscv = TimeSeriesSplit(2)` (see the documentation of `TimeSeriesSplit` in sklearn site). Here, I used the value 2 to obtain the output you indicated. In general, if you want to have X% of the total number of days in your test set, you pass n_splits=1/X. For example, if you want 20% of the days in the test set and 80% in the training set, then you pass n_splits=1/0.2=5, and so on. – Jau A Oct 21 '22 at 09:34
-
Thank you do much! It works perfectly, I would upvote this, but can't as I need 15 reputation points btw – andyolivers Oct 21 '22 at 10:20
-
How can this tscv be used with GridSearchCV? Unique index is needed for the search object. – devcloud Jun 27 '23 at 14:14