I am suspecting that both h2o's and caret's data partitioning functions may be leaking data somehow. The reason why I suspect this is that I get two, completely different results when using either h2o's h2o.splitFrame function or caret's createDataPartition function - vs when I manually partition the data myself:
In my dataframe with time-series data, 3000-4000 data points, and using 10-fold CV, I'm obtaining very acceptable results in all data sets: training, validation, cross-validation, and test sets when using either caret's xgboost or h2o. However, these high r2/low RMSE (good) results occur only when I use caret's createDataPartition function and h2o's h2o.splitFrame function.
On the other hand, if I manually remove a portion of data myself to create a completely separate test set dataframe (while still using appropriate partitioning function to split data into train and validation sets), then the manually-created test set results are poor - while the the training, validation, and cross-validation results remain good (high r2/low RMSE).
To establish a control, I can intentionally mess up the data - use random numbers, remove multiple features/columns, etc - and the predictive results are bad in training, validation, 10-fold CV, and test sets.
For example, here is the h2o code I used to partition the data. These strange, inconsistent results that I'm seeing when I use package-specific, data partitioning functions makes me wonder if all this "amazing accuracy" that can be read about in the media about machine learning - could be partially due to data leaking when using partitioning functions?
#Pre-process
df_h2o <- write.csv(df, file = "df_h2o.csv")
df <- h2o.importFile(path = normalizePath("df_h2o.csv"), destination_frame
= "df")
## Pre moodel
#Splitting the data
splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234)
train <- h2o.assign(splits[[1]], "train")
valid <- h2o.assign(splits[[2]], "valid")
test <- h2o.assign(splits[[3]], "test") ### <--- My test results are poor if
### I manually partition test data
### myself (i.e. without using any
### partitioning function).
### Otherwise, results are good.
#Identify the variable that will predict and independent variables
y <- "Target"
x <- setdiff(colnames(train),y)
#Build model
model<-h2o.deeplearning(x=x,
y=y,
seed = 1234,
training_frame = train,
validation_frame = valid,
hidden = c(40,40),
nfolds = 10,
stopping_rounds = 7,
epochs = 500,
overwrite_with_best_model = TRUE,
standardize = TRUE,
activation = "Tanh",
loss = "Automatic",
distribution = "AUTO",
stopping_metric = "MSE",
variable_importances=T)