0

I have a data set made of several months (from JAN-15 do SEPT-17), reporting a customer financial situation for each month. My task it to predict the cumulative sales for each customer for the next 12 months.

My dataset looks like this (this is the raw data, for training I will create lagged features)

Month   CustomerID NetSales
JAN-15     A          10
JAN-15     B          10
JAN-15     C          10
FEB-15     A          10
FEB-15     B          10
FEB-15     C          10
...

How can I split in TRAIN / VAL / TEST it with consistency to time? Can I do something like this?

  • TRAIN --> all customer / months from JAN-15 to MAR-16 (I take each month at least once so the model will learn seasonal patterns
  • VAL --> all customer / months from APR-16 to JUN-16
  • TEST --> all customer / months from JUL-16 to SEP-16 (I stop here because I neeed the followin 12 months to create the target variable)

Is this a consistent split strategy? In alternative, what would you advice?

Thanks a lot, Andrea

  • Hello @andrea-barral, I do not have very much experience, but one old task in kaggle has imho a very good strategy for splitting data: `You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.` – 404pio Oct 01 '19 at 07:39

1 Answers1

0

Is this a consistent split strategy?

Yes, you are respecting the fact, that you not use the data for your validation set which is before your training data, same for your test set. You are preventing data leakage, this is the right way to do it.

In alternative, what would you advice?

The only thing which you can change is the portion of your train,val,test set, but this you can try. As it is a timeseries you should consider seasonal trends, that they are all covered in your training data.

PV8
  • 5,799
  • 7
  • 43
  • 87