I'm trying to use knn on my dataset that has 65499 rows and 6 columns
My dataset:
> dput(head(sampleknn))
structure(list(RequestorSeniority = c(1L, 2L, 2L, 4L, 1L, 4L),
ITOwner = c(50L, 15L, 15L, 22L, 22L, 38L), Severity = c(2L,
1L, 2L, 2L, 2L, 2L), Priority = c(0L, 1L, 0L, 0L, 1L, 3L),
daysOpen = c(3L, 5L, 0L, 20L, 1L, 0L), Satisfaction = structure(c(4L,
4L, 3L, 3L, 4L, 3L), .Label = c("Amazing", "Satisfied", "Unknown",
"Unsatisfied"), class = "factor")), .Names = c("RequestorSeniority",
"ITOwner", "Severity", "Priority", "daysOpen", "Satisfaction"
), row.names = c(NA, 6L), class = "data.frame")
>str(sampleknn)
'data.frame': 65499 obs. of 6 variables:
$ RequestorSeniority: int 1 2 2 4 1 4 3 4 2 3 ...
$ ITOwner : int 50 15 15 22 22 38 10 1 14 46 ...
$ Severity : int 2 1 2 2 2 2 2 2 2 2 ...
$ Priority : int 0 1 0 0 1 3 3 0 2 1 ...
$ daysOpen : int 3 5 0 20 1 0 9 15 6 1 ...
$ Satisfaction : Factor w/ 4 levels "Amazing","Satisfied",..: 4 4 3 3 4 3 3 3 4 4 ...
Now I'm trying to use knn on this dataset (code below) and it gives me the following error:
Error in knn(train = sampleknn_train, test = sampleknn_test, cl = sampleknn_test_target, : 'train' and 'class' have different lengths
Code:
sampleknn <- read.csv(file="HelpDesk.csv",head=TRUE,sep=",")
str(sampleknn)
#---scaling
normalize <- function(x) {
return((x-min(x))/(max(x)-min(x)))
}
sampleknn_n <- as.data.frame(lapply(sampleknn[ ,c(1,2,3,4,5)], normalize))
str(sampleknn_n)
#train the data from sampleknn_n
sampleknn_train <- sampleknn_n[1:65000, ]
#create a test dataset
sampleknn_test <- sampleknn_n[65001:65499, ]
#isolate test and train satisfaction levels
sampleknn_train_target <- sampleknn[1:65000, 6]
sampleknn_test_target <- sampleknn[65001:65499, 6]
#-----knn model
sqrt(65499)
m1 <- knn(train=sampleknn_train, test=sampleknn_test, cl=sampleknn_test_target,k=255)
Now, when I run the last line (m1 <-...) it gives me the error 'train' and 'class' have different lengths. I tried looking for answers which talks about the same issue but nothing seems to work for me. What is the fix for this issue? Kindly let me know if you need more information.
Edit:
Before Normalization:
RequestorSeniority ITOwner Severity Priority daysOpen Satisfaction
1 50 2 0 3 Unsatisfied
2 15 1 1 5 Unsatisfied
2 15 2 0 0 Unknown
4 22 2 0 20 Unknown
1 22 2 1 1 Unsatisfied
4 38 2 3 0 Unknown
After Normalization:
RequestorSeniority ITOwner Severity Priority daysOpen
0.0000000000 1.0000000000 0.50 0.0000000000 0.05555555556
0.3333333333 0.2857142857 0.25 0.3333333333 0.09259259259
0.3333333333 0.2857142857 0.50 0.0000000000 0.00000000000
1.0000000000 0.4285714286 0.50 0.0000000000 0.37037037037
0.0000000000 0.4285714286 0.50 0.3333333333 0.01851851852
1.0000000000 0.7551020408 0.50 1.0000000000 0.00000000000
> dput(head(sampleknn_n))
structure(list(RequestorSeniority = c(0, 0.333333333333333, 0.333333333333333,
1, 0, 1), ITOwner = c(1, 0.285714285714286, 0.285714285714286,
0.428571428571429, 0.428571428571429, 0.755102040816326), Severity = c(0.5,
0.25, 0.5, 0.5, 0.5, 0.5), Priority = c(0, 0.333333333333333,
0, 0, 0.333333333333333, 1), daysOpen = c(0.0555555555555556,
0.0925925925925926, 0, 0.37037037037037, 0.0185185185185185,
0)), .Names = c("RequestorSeniority", "ITOwner", "Severity",
"Priority", "daysOpen"), row.names = c(NA, 6L), class = "data.frame")