2

I have a dataset of daily transactions where there are multiple records per day. I need to split it into different cross validation folds to train a ML model, however I can't user TimeSeriesSplit from sklearn as there are multiple transactions per day. Do you know how can I do this in python?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • If the records are ordered by time-stamp, in principle you could still use `TimeSeriesSplit`, after ordering by time-stamp... you would have records from the same day in both training and test, but the important thing is that the records in the training set have a time-stamp lower than those in test set. If you want to avoid to have records from the same day in training and test, then: if the number of records per day is constant you can still use `TimeSeriesSplit`. If it is variable, you can transform days into tuples, pass a list of tuples to `TimeSeriesSplit`, and then convert back? – Jau A Oct 12 '22 at 21:55
  • Thanks for the reply. But then how can I control the test size? I want to implement a rolling window approach. Hence I need to keep the transactions for each day separated from each other – andyolivers Oct 14 '22 at 19:35
  • Could you please provide a small example of input data and desired output? That would help clarify your problem and come up with a solution. – Jau A Oct 14 '22 at 20:44
  • `import numpy as np import pandas as pd data = np.array([['DAY_1','afds',5], ['DAY_1','rtws', 4], ['DAY_1','gtssd', 2], ['DAY_2','ititl', 4], ['DAY_2','uius', 7], ['DAY_3','hyaah', 6], ['DAY_4','apsaj', 9]]) df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE']) df` – andyolivers Oct 18 '22 at 10:45
  • So considering that example, I need to use rows from DAY_1 and DAY_2 to train and then test with DAY_3. Then use DAY_1, DAY_2 and DAY_3 to train, testing with DAY_4. Thank you for your help @JauA – andyolivers Oct 18 '22 at 10:47
  • Thanks @andyolivers, I just posted a solution to this problem. You might want to edit your question to include the example you gave me, in order to make it more clear for other people. – Jau A Oct 18 '22 at 20:51

1 Answers1

1

Input data:

import numpy as np
import pandas as pd
data = np.array(
    [['DAY_1','afds',5],
     ['DAY_1','rtws', 4],
     ['DAY_1','gtssd', 2],
     ['DAY_2','ititl', 4],
     ['DAY_2','uius', 7],
     ['DAY_3','hyaah', 6],
     ['DAY_4','apsaj', 9]])
df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE'])

Resulting df:

        TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7
DAY_3            hyaah     6
DAY_4            apsaj     9

Solution:

from sklearn.model_selection import TimeSeriesSplit

df = df.set_index('DATEDAY')
days = np.sort(df.index.unique())
tscv = TimeSeriesSplit(2)
for train_index, test_index in tscv.split(days):
    print ('------------------------------')
    train_days, test_days = days[train_index], days[test_index]
    X_train, X_test = df.loc[train_days], df.loc[test_days]
    print ('train:', X_train, '\n')
    print ('test:', X_test, '\n')

Output:

------------------------------
train:         TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7

test:         TRANSACTION_ID PRICE
DATEDAY
DAY_3            hyaah     6

------------------------------
train:         TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7
DAY_3            hyaah     6

test:         TRANSACTION_ID PRICE
DATEDAY
DAY_4            apsaj     9

Note 1: we assume that the date column can be sorted. In this example, DAY_X doesn't sort well, since DAY_11 would be placed before DAY_2, for instance. If we only know the number X of the day, then we need to put X in the column, instead of DAY_X, e.g., we might do something like:

df['DATEDAY'] = [int(x.split('_')[1]) for x in df['DATEDAY']]

Note 2: if we want to avoid having DATEDAY as index of the dataframe, we can simply reset the index for X_train and X_test:

for train_index, test_index in tscv.split(days):
    print ('------------------------------')
    train_days, test_days = days[train_index], days[test_index]
    X_train, X_test = df.loc[train_days].reset_index(), df.loc[test_days].reset_index()
    print ('train:\n', X_train, '\n')
    print ('test:\n', X_test, '\n')

Output:

------------------------------
train:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_1           afds     5
1   DAY_1           rtws     4
2   DAY_1          gtssd     2
3   DAY_2          ititl     4
4   DAY_2           uius     7

test:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_3          hyaah     6

------------------------------
train:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_1           afds     5
1   DAY_1           rtws     4
2   DAY_1          gtssd     2
3   DAY_2          ititl     4
4   DAY_2           uius     7
5   DAY_3          hyaah     6

test:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_4          apsaj     9
Jau A
  • 395
  • 1
  • 10
  • Thank you, that's exactly it! – andyolivers Oct 18 '22 at 23:01
  • Apologies from my side. I am new here and thought I had to tag it again as unresolved. Anyway, up to you and once again, thanks for your help. – andyolivers Oct 20 '22 at 22:25
  • In order to change the number of days per split, you need to change the parameter `n_splits` passed to `TimeSeriesSplit`, in the statement: `tscv = TimeSeriesSplit(2)` (see the documentation of `TimeSeriesSplit` in sklearn site). Here, I used the value 2 to obtain the output you indicated. In general, if you want to have X% of the total number of days in your test set, you pass n_splits=1/X. For example, if you want 20% of the days in the test set and 80% in the training set, then you pass n_splits=1/0.2=5, and so on. – Jau A Oct 21 '22 at 09:34
  • Thank you do much! It works perfectly, I would upvote this, but can't as I need 15 reputation points btw – andyolivers Oct 21 '22 at 10:20
  • How can this tscv be used with GridSearchCV? Unique index is needed for the search object. – devcloud Jun 27 '23 at 14:14