1

I am suspecting that both h2o's and caret's data partitioning functions may be leaking data somehow. The reason why I suspect this is that I get two, completely different results when using either h2o's h2o.splitFrame function or caret's createDataPartition function - vs when I manually partition the data myself:

In my dataframe with time-series data, 3000-4000 data points, and using 10-fold CV, I'm obtaining very acceptable results in all data sets: training, validation, cross-validation, and test sets when using either caret's xgboost or h2o. However, these high r2/low RMSE (good) results occur only when I use caret's createDataPartition function and h2o's h2o.splitFrame function.

On the other hand, if I manually remove a portion of data myself to create a completely separate test set dataframe (while still using appropriate partitioning function to split data into train and validation sets), then the manually-created test set results are poor - while the the training, validation, and cross-validation results remain good (high r2/low RMSE).

To establish a control, I can intentionally mess up the data - use random numbers, remove multiple features/columns, etc - and the predictive results are bad in training, validation, 10-fold CV, and test sets.

For example, here is the h2o code I used to partition the data. These strange, inconsistent results that I'm seeing when I use package-specific, data partitioning functions makes me wonder if all this "amazing accuracy" that can be read about in the media about machine learning - could be partially due to data leaking when using partitioning functions?

#Pre-process
df_h2o <- write.csv(df, file = "df_h2o.csv")
df <- h2o.importFile(path = normalizePath("df_h2o.csv"),  destination_frame 
= "df")

## Pre moodel
#Splitting the data
splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
train <- h2o.assign(splits[[1]], "train")   
valid <- h2o.assign(splits[[2]], "valid")   
test <- h2o.assign(splits[[3]], "test") ### <--- My test results are poor if 
                                        ### I manually partition test data                                            
                                        ### myself (i.e. without using any
                                        ### partitioning function). 
                                        ### Otherwise, results are good.   

#Identify the variable that will predict and independent variables
y <- "Target"
x <- setdiff(colnames(train),y)

#Build model
model<-h2o.deeplearning(x=x,              
                    y=y,
                    seed = 1234,
                    training_frame = train,
                    validation_frame = valid,
                    hidden = c(40,40),
                    nfolds = 10,
                    stopping_rounds = 7,
                    epochs = 500,
                    overwrite_with_best_model = TRUE,
                    standardize = TRUE,
                    activation = "Tanh",
                    loss = "Automatic",
                    distribution = "AUTO",
                    stopping_metric = "MSE",
                    variable_importances=T)
jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
ogukku
  • 53
  • 7

1 Answers1

0

If you mean "data leakage", that term is commonly associated with the process of accidentally including a predictor column that is correlated (in a "bad"/cheating way) with the response column. Splitting a data frame by rows would never cause spurious data leakage.

If your manually created train, valid and test frames create poor results, that might be due to the fact that your rows are not randomized and your model has overfit to the training set. Since h2o::h2o.splitFrame() and caret::createDataPartition() use randomization in the splitting process, it will produce evenly distributed train, valid and test sets and producing a better model. However, if your manual datasets were also randomly created, then what you're saying doesn't make a lot of sense to me.

Hopefully you are using h2o.performance(model, test) to determine the test set accuracy (although I don't see that in your script). By default, H2O's Deep Learning will use the validation_frame for early stopping, so the validation metrics are not a honest estimate of the generalization error. Hence the need for a test set.

Erin LeDell
  • 8,704
  • 1
  • 19
  • 35
  • +1 thank you Erin for your easy-to-understand answers. 1) Yes I see that I'm using data leakage incorrectly. Thank you for the clarification. I am splitting by rows, and I do not have any columns that can leak to the target variable. 2) I only manually remove the test data set myself, do not randomize, and the test set consists of the most recent 3 yrs of data from the original data set; then I use the appropriate partitioning function to split the remaining data into the train and validation sets. It is here and only here when I manually create the test set myself that I get poor results. – ogukku Apr 18 '17 at 16:05
  • 3) Yes I use h2o.performance and the results are very acceptable when using partitioning functions; I even plot the histogram of the residuals which turn out to be normally distributed. 4) But it still doesn't make sense to me b/c it seems that the model should predict equally well whether I use the randomizing data partitioning functions or a manually created holdout used as the test set (and not randomized). – ogukku Apr 18 '17 at 16:13
  • One final issue: the reason why I decided to manually create a holdout, test data set myself instead of just relying on a partitioning function is b/c I wanted to rule out some kind of "sham" that may be occurring. With both h2o and caret's xgb, the results are simply TOO good to believe! R-squared >0.9 on test data set and in all 10-fold CV along with minimal SD b/t iterations and low RMSE and normally distributed residuals ...all of it just makes me doubt that machine learning can really be this good. I need to know what to believe: is my model good or bad? – ogukku Apr 18 '17 at 16:35
  • It sounds like you have time-series data, which would explain why your manual partitioning (without randomization) would produce poor results. If the data generating distribution shifts over time, then training a model on "old" data and predicting on "new" data may not work well if you don't have any features that capture recent event data. When you randomize, you get a mix of new and old data in your training set and your test set, which is why you get a model that generalizes better to the test set. – Erin LeDell Apr 18 '17 at 20:08
  • 1
    Also, don't worry about R^2 when dealing with tree models. Just look at RMSE. – Erin LeDell Apr 18 '17 at 20:09
  • I am very grateful for your knowledgeable explanations. Thank you! I think I understand what you are saying - that my non-randomized manual test set consists only of recent data while the model has trained in both old and new (correct?). However while it is likely that the distribution of my time series data does shifts over time, I actually believe that I do have good features that capture recent event data. In fact, these features rank highly in importance. But the potential inability to generalize on new data is concerning though I understand randomization is to improve generalization. – ogukku Apr 18 '17 at 22:53
  • Perhaps there is a better way to partition time series data for train, validate, CV, and test sets - such as one that randomizes but weights recent data more heavily than earlier data? Also yes I only use r2 when regressing target variable ~ truth to check how the model performs on the test data set. – ogukku Apr 18 '17 at 23:06
  • There is "rolling CV" which can be used for time-series data. It guarantees that your training set will be trained on rows earlier in time than the test set rows since that's how the model would be used in practice. Explained here: https://stats.stackexchange.com/questions/268613/classification-regression-with-rolling-window-for-time-series-type-data – Erin LeDell Apr 19 '17 at 01:09
  • 1) My sincerest thanks! Exactly what I was looking for! 2) I knew a key part was time series data wrt cross validation, but I only knew enough to mention that my data was a timeseries. 3) Strangely however, the link provided references a 2015 article offering a proof on how k-fold CV is valid to perform on time series data - and even bests OOS testing. That of course doesn't seem to apply to my model unfortunately. But rather than rely on "model will generalize well with future data b/c of x and y, let's productionalize", I'd rather work to apply a solution like rolling CV. – ogukku Apr 19 '17 at 03:57
  • I'd even consider some consulting work if any were available b/c it is just me working on this model, but it's important! I do hope that somehow I can return your favor to you in the future. Thank you again! – ogukku Apr 19 '17 at 04:02
  • This response by Arno Candel seems to state that splitFrame() does NOT introduce randomization into the splitting process. https://stats.stackexchange.com/questions/168480/inverse-progression-for-training-validation-data-during-training-with-h2o – ogukku Apr 23 '17 at 13:22
  • Therefore if randomization is NOT used by the splitFrame function and instead it simply cuts the data set into consecutive pieces per the user's argument (i.e. c(0.6, 0.2) and as mentioned in the link above by Arno, then that would seem to suggest that the 10-fold CV in my situation should have folds that are temporally related in my time series data. This means that when I split my original data set manually and without randomization, my test set results should be just as good as when I use the splitframe function. So now I think Im back to square 1 in understanding this mystery. Suggestions? – ogukku Apr 23 '17 at 13:32
  • My current theory: 1) I forgot to mention that even though my data is a daily time series, I have removed all dates and minimized as much temporal component as possible by choosing features and target carefully. 2) This domain specific knowledge might have allowed the splitFrame randomization process to work better and even allow the model to train on "old" data as well as "new" - since I deliberately chose features and engineered them to be "timeless". 3) Since I've used all data, only thing left is the best OOS data - fresh, live data coming in. Thanks to all who contribute to h2o. – ogukku Apr 27 '17 at 01:39