0

This is the problem instructions I was given.

  • Build a K-NN classifier, use 5-fold cross-validation to evaluate its performance based on average accuracy.

  • Report accuracy measure for k = 2, ..., 10

  • Write your code below (Hint: you need a loop within a loop, with the outer loop going through each value of k and inner loop going through each fold):

  • You can manually try k=2,...,10, but try to use a outer loop through each value of k.

I was given 2 for loops. One for Creating folds and the other for calculating k=1:10, which are listed below.

# Given data
library(datasets)
data(iris)

library(dplyr)
normalize = function(x){
return ((x - min(x))/(max(x) - min(x)))}
# normalize
Iris_normalized = IrisData %>% mutate_at(1:4, normalize)


# Create folds
cv = createFolds(y = IrisData$class, k = 5)
accuracy = c()
for (test_rows in cv) {

   IrisData_train = IrisData[-test_rows,]
   IrisData_test = IrisData[test_rows,]

   tree = rpart(class ~ ., data = IrisData_train,
   method = "class", parms = list(split = "information"))
   pred_tree = predict(tree, IrisData_test, type = "class")
   cm = confusionMatrix(pred_tree, IrisData_test[,5])

   accuracy = c(accuracy, cm$overall[1])
   }

print(mean(accuracy))


# Manual K validation
SSE_curve <- c()
for (k in 1:10) {
   print(k)
   kcluster = kmeans(utility_normalized, center = k)
   sse = kcluster$tot.withinss
   print(sse)
   SSE_curve[k] = sse
}

So if I am understanding the instructions correctly. I need to:

  • Create 5 folds using normalized data with a for loop and set.seed.
  • Use a for loop to find the accuracy in k=1:10 for each fold.

I am not sure how these 2 for-loops combine to give me this result in the instructions.

1 Answers1

1

I imagine the code you provide is just an example and this question sounds a lot like a student homework problem. You should at least provide your effort so far. However here are two possible solutions: 1)two nested for-loop:

library(class)
library(dplyr)
data("iris")
normalize = function(x){
  return ((x - min(x))/(max(x) - min(x)))}
# normalize
Iris_normalized = iris %>% mutate_at(1:4, normalize)
av_accuracy <- as.list(2:10)

for (k in 2:10) {
  set.seed(4)
  cv  <-  createFolds(y = Iris_normalized$Species, k = 5)
  accuracy  <-  c()

  for (i in cv) {
    IrisData_train = Iris_normalized[-i,]
    IrisData_test = Iris_normalized[i,]
    
    tree  <-  knn(IrisData_train[,-5],IrisData_test[,-5],cl=IrisData_train$Species,k=k)
    cm  <-  confusionMatrix(tree, IrisData_test[,5])
    
    accuracy  <-  c(accuracy, cm$overall[1])
  }
  av_accuracy[[k-1]] <- mean(accuracy)
}
results <- data.frame(k=2:10,mean.accuracy=unlist(av_accuracy))
  1. using the caret framework, which is built exactly for this kind of task:

    control <- trainControl(method = "cv",5,returnResamp="all",)
      grid <- expand.grid(k=2:10)
      fit <-
        train(
          Species ~ .,
          data = Iris_normalized,
          trControl = control,
          tuneGrid = grid,
          method = "knn"
        )
        fit$results
    
Elia
  • 2,210
  • 1
  • 6
  • 18
  • Thanks for the response. It is indeed a hw assignment. I had the caret framework done to check my answer. I just couldn't wrap my mind how to apply the k=2:10 in the for loops in which you did so very nicely in your results variable. I didn't think about doing it that way and I was not getting much help from the professor. Thank you for your help. – FannyPackFanatic Sep 26 '21 at 19:50
  • This is how I was approaching it. `for (test_rows in cv) { normalized_train = wine_normalized[,] normalized_test = wine_normalized[,] i=1 # declaration to initiate for loop k.optm=1 # declaration to initiate for loop for (i in 1:10) knn.mod <- knn(train = normalized_train[,2:14, drop = FALSE], test = normalized_test[,2:14, drop = FALSE], cl = wine_normalized$Type , k=i) k.optm[i] <- 100 * sum(normalized_test == knn.mod)/NROW(normalized_test) k=i cat(k,'=',k.optm[i],'\n')} dim(test_rows) ` – FannyPackFanatic Sep 26 '21 at 19:51