0

I am trying to do an assignment for data splitting (training set, validation set, and test set) to find the most suitable classifier --in this case, k, since I am using k-nearest neighbors (kknn function, part of kernlab package). However, when I use the initial code below to randomize the data splitting process and run for loops to determine the most accurate values of k, I am not getting consistent values of k each time I run the loops. The numbers are ALL OVER the place. Did I partition my data correctly? I was corrected in a previous post for not producing a minimal reproducible example (MRE), so here is my attempt to provide an MRE code:


#split data set into three groups, using "random" process in order to try to eliminate bias:
#currently an 80-10-10 split
#'data' in the code represents a data.frame with well over 100 data points

idx <- sample(seq(1, 3), size = nrow(data), replace = TRUE, prob = c(.8, .1, .1))
data_train <- data[idx == 1,] #training set
data_test <- data[idx == 2,] #test set
data_valid <- data[idx == 3,] #validation set

#Here is how I initialize my list to store the accuracy values for each k:

kknn_acc_list = list()

#Here is my for loop to test validation set:

for(i in 1:100){
  model_KNN <-kknn(V5~., data_train, data_valid, k = i, scale = TRUE)
  pred <- round(fitted(model_KNN)) == data_valid$V11  #predictions from the fitted function
  x = sum(pred) / nrow(data_valid) #accuracy measurement -- average number of predictions returned TRUE

  kknn_acc_list[[i]] = x
}

# validation set accuracy list:

kknn_acc_list

}

After apply I unlist() function to the list in order to get a matrix, I use which() and max() function to determine the k value with maximum accuracy. With each run of the loop I get a wide range of different k values, each different from previous runs of the loop. When I apply the same type of loop for my test set (data_test), I encounter the same problem. Can anyone help me find a solution in order to hone in a particular or set of particular consistent k values?

Deemy
  • 1
  • 1

1 Answers1

2

You need to set a seed to start the 'random selection' in the same place each time and then do the same computations inside the loop. It is pretty simple, right before the splitting, set.seed(42) you can use any number you want in there.

That should keep your data consistent throughout multiple runs of your code!

But also, you are using that equation incorrectly. Accuracy is not the number predicted to be one of the total...it is the number of observations correctly predicted to be 1 and correctly predicted to be 0 over the number of observations.

true positive +true negatives / all observations

You might find that some of the problem is simply using the wrong metric. However, you should get used to using seeds now, it is a required component of reproducible work!

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • A bit confused: the output for the following equation: pred <- round(fitted(model_KNN)) == data_valid$V11 is [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE [20] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [39] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE [58] TRUE TRUE TRUE --- the sum of the TRUE's divided by total observations is sum(pred) / nrow(data_valid) -- What is wrong??? – Deemy Jan 22 '20 at 21:12
  • sorry, that `== data_valid$V11` does not scroll on my screen, if you had that, you are good! – sconfluentus Jan 22 '20 at 23:09
  • and also,, you might consider being more explicit with naming. Typically `pred` is used for the actual predicted values, 0 or 1 , I would change that variable to `correct` or something, then pass it into the accuracy variable over the `nrow()` so that it is more clear! – sconfluentus Jan 22 '20 at 23:12