0

Best Way to Perform TimeSeries Cross Validation with irregular Dates of Observations and uneven Observations per Date?

I have a dataset that I have been trying to utilize for XGBoost Regression. The problem I am encountering is how best to apply TimeSeries Cross Validation (or Group Time Series Cross Validation) for my train and test sets.

My dataset includes the target variable, the date of observation, and then feature values for the date of observation of the target variable. Each date of observation has an average of 5 target observations per day, however, there are dates were 4 or 10 observations were recorded. Regardless, most dates have 5 observations recorded.

I have found this question/answer which I think can work for my use-case, however, it would require me to trim down the target variable's observations on days where observations are greater than 4, so that each date has exactly 4 target observations.

Split time series with multiple records per day

Is there an appropriate method in determining which observations to remove, so that I can have all observation dates having exactly 4 observations? Or, if possible, determine a way to not remove observations and perform GroupTimeSeries Cross Validation on the entire dataset?

I cant do a random split, so I split my dataset into train/test based on a specific date index (70/30 split).

This is an example of my dataframe

                     Target      Feature      dayofmonth  weekofyear
Obs. Date                                      ...                        
2008-06-16                 140.2           25  ...          16          
2008-06-16                 140.7           25  ...          16          
2008-06-16                 139.0           25  ...          16          
2008-06-16                 144.5           25  ...          16          

2018-09-04                  64.9           36  ...           4          
2018-09-04                  72.9           36  ...           4          
2018-09-04                  75.6           36  ...           4          
2018-09-04                  71.6           36  ...           4          
2018-09-04                  74.9           36  ...           4          

[618 rows x 46 columns]
desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

If you want to remove extra observations you could try grouping them by date and simply selecting four closest to the day mean observations or selecting four most "representative" observations (0-percentile, 33-percentile, 66-percentile, 100-percentile obervations in series sorted by observation time).

If you don't want to remove observations you could try some interpolation methods to "fill the blanks".

However its seems best to combine both methods: analyze your time series and select median number of observations per day. For days with number of observations less than median apply interpolation, for days with number of observations higher than median remove extra observations

K0mp0t
  • 89
  • 1
  • 6