I am performing K-Folds cross validation to evaluate my SVM model performance. However with the nature of the data, I want to use feature scaling to scale my data. Here is a snippet of the data;
# IMPORTING THE DATASET
dataset <- read.csv("imported dataset.csv")
# ENCODING THE DEPENDENT VARIABLE AS A FACTOR
dataset$Purchased <- factor(dataset$Purchased, levels = c(0, 1))
# DATASET
Age EstimatedSalary Purchased
1 19 19000 0
2 35 20000 0
3 26 43000 0
4 27 57000 0
5 19 76000 0
6 27 58000 0
And here is the rest of the code;
# TRAIN TEST SPLIT
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# K-FOLD CV WITH FEATURE SCALING
trCtrl <- trainControl(method = "repeatedcv",
number = 10, #10-fold CV
repeats = 10,
savePredictions = TRUE)
model <- train(Purchased ~ .,
data=train_set,
method="svmRadial",
trControl = trCtrl,
preProcess = c("center","scale"))
I know that feature scaling and then running K-folds CV on the original training set will cause data leakage since both the inner training and validation sets have been scaled together, hence causing overfitting.
I would like to know does the preProcess function in the caret package scale the data in a way that avoids this and scales the inner training sets and validation sets separately?