I am trying to do an assignment for data splitting (training set, validation set, and test set) to find the most suitable classifier --in this case, k, since I am using k-nearest neighbors (kknn function, part of kernlab package). However, when I use the initial code below to randomize the data splitting process and run for loops to determine the most accurate values of k, I am not getting consistent values of k each time I run the loops. The numbers are ALL OVER the place. Did I partition my data correctly? I was corrected in a previous post for not producing a minimal reproducible example (MRE), so here is my attempt to provide an MRE code:
#split data set into three groups, using "random" process in order to try to eliminate bias:
#currently an 80-10-10 split
#'data' in the code represents a data.frame with well over 100 data points
idx <- sample(seq(1, 3), size = nrow(data), replace = TRUE, prob = c(.8, .1, .1))
data_train <- data[idx == 1,] #training set
data_test <- data[idx == 2,] #test set
data_valid <- data[idx == 3,] #validation set
#Here is how I initialize my list to store the accuracy values for each k:
kknn_acc_list = list()
#Here is my for loop to test validation set:
for(i in 1:100){
model_KNN <-kknn(V5~., data_train, data_valid, k = i, scale = TRUE)
pred <- round(fitted(model_KNN)) == data_valid$V11 #predictions from the fitted function
x = sum(pred) / nrow(data_valid) #accuracy measurement -- average number of predictions returned TRUE
kknn_acc_list[[i]] = x
}
# validation set accuracy list:
kknn_acc_list
}
After apply I unlist() function to the list in order to get a matrix, I use which() and max() function to determine the k value with maximum accuracy. With each run of the loop I get a wide range of different k values, each different from previous runs of the loop. When I apply the same type of loop for my test set (data_test), I encounter the same problem. Can anyone help me find a solution in order to hone in a particular or set of particular consistent k values?