0

I'm trying to use knn on my dataset that has 65499 rows and 6 columns

My dataset:

    > dput(head(sampleknn))
structure(list(RequestorSeniority = c(1L, 2L, 2L, 4L, 1L, 4L), 
    ITOwner = c(50L, 15L, 15L, 22L, 22L, 38L), Severity = c(2L, 
    1L, 2L, 2L, 2L, 2L), Priority = c(0L, 1L, 0L, 0L, 1L, 3L), 
    daysOpen = c(3L, 5L, 0L, 20L, 1L, 0L), Satisfaction = structure(c(4L, 
    4L, 3L, 3L, 4L, 3L), .Label = c("Amazing", "Satisfied", "Unknown", 
    "Unsatisfied"), class = "factor")), .Names = c("RequestorSeniority", 
"ITOwner", "Severity", "Priority", "daysOpen", "Satisfaction"
), row.names = c(NA, 6L), class = "data.frame")

>str(sampleknn)
    'data.frame':   65499 obs. of  6 variables:
     $ RequestorSeniority: int  1 2 2 4 1 4 3 4 2 3 ...
     $ ITOwner           : int  50 15 15 22 22 38 10 1 14 46 ...
     $ Severity          : int  2 1 2 2 2 2 2 2 2 2 ...
     $ Priority          : int  0 1 0 0 1 3 3 0 2 1 ...
     $ daysOpen          : int  3 5 0 20 1 0 9 15 6 1 ...
     $ Satisfaction      : Factor w/ 4 levels "Amazing","Satisfied",..: 4 4 3 3 4 3 3 3 4 4 ...

Now I'm trying to use knn on this dataset (code below) and it gives me the following error:

Error in knn(train = sampleknn_train, test = sampleknn_test, cl = sampleknn_test_target, : 'train' and 'class' have different lengths

Code:

sampleknn <- read.csv(file="HelpDesk.csv",head=TRUE,sep=",")
str(sampleknn)
#---scaling
normalize <- function(x) {
  return((x-min(x))/(max(x)-min(x)))
}

sampleknn_n <- as.data.frame(lapply(sampleknn[ ,c(1,2,3,4,5)], normalize))
str(sampleknn_n)

#train the data from sampleknn_n
sampleknn_train <- sampleknn_n[1:65000, ]
#create a test dataset
sampleknn_test <- sampleknn_n[65001:65499, ]
#isolate test and train satisfaction levels
sampleknn_train_target <- sampleknn[1:65000, 6]
sampleknn_test_target <- sampleknn[65001:65499, 6]

#-----knn model
sqrt(65499)
m1 <- knn(train=sampleknn_train, test=sampleknn_test, cl=sampleknn_test_target,k=255)

Now, when I run the last line (m1 <-...) it gives me the error 'train' and 'class' have different lengths. I tried looking for answers which talks about the same issue but nothing seems to work for me. What is the fix for this issue? Kindly let me know if you need more information.

Edit:

Before Normalization:

RequestorSeniority ITOwner Severity Priority daysOpen Satisfaction
                  1      50        2        0        3  Unsatisfied
                  2      15        1        1        5  Unsatisfied
                  2      15        2        0        0      Unknown
                  4      22        2        0       20      Unknown
                  1      22        2        1        1  Unsatisfied
                  4      38        2        3        0      Unknown

After Normalization:

RequestorSeniority      ITOwner Severity     Priority      daysOpen
       0.0000000000 1.0000000000     0.50 0.0000000000 0.05555555556
       0.3333333333 0.2857142857     0.25 0.3333333333 0.09259259259
       0.3333333333 0.2857142857     0.50 0.0000000000 0.00000000000
       1.0000000000 0.4285714286     0.50 0.0000000000 0.37037037037
       0.0000000000 0.4285714286     0.50 0.3333333333 0.01851851852
       1.0000000000 0.7551020408     0.50 1.0000000000 0.00000000000

> dput(head(sampleknn_n))
structure(list(RequestorSeniority = c(0, 0.333333333333333, 0.333333333333333, 
1, 0, 1), ITOwner = c(1, 0.285714285714286, 0.285714285714286, 
0.428571428571429, 0.428571428571429, 0.755102040816326), Severity = c(0.5, 
0.25, 0.5, 0.5, 0.5, 0.5), Priority = c(0, 0.333333333333333, 
0, 0, 0.333333333333333, 1), daysOpen = c(0.0555555555555556, 
0.0925925925925926, 0, 0.37037037037037, 0.0185185185185185, 
0)), .Names = c("RequestorSeniority", "ITOwner", "Severity", 
"Priority", "daysOpen"), row.names = c(NA, 6L), class = "data.frame")
demongolem
  • 9,474
  • 36
  • 90
  • 105
  • Can you give us a *reproduicble* example? http://stackoverflow.com/help/mcve – Hack-R Oct 12 '16 at 21:52
  • @Hack-R here is the example i'm trying to replicate, (however in the video he uses the iris dataset) https://www.youtube.com/watch?v=DkLNb0CXw84 –  Oct 12 '16 at 21:55
  • Thanks but you need to provide your reproducible example in the question so that we can copy and paste it to reproduce your result. BTW did you look at http://stackoverflow.com/questions/16276388/knn-in-r-train-and-class-have-different-lengths?rq=1 ? – Hack-R Oct 12 '16 at 22:14
  • @Hack-R yes i did but that doesn't solves it, btw i edited the Q with the dataset I'm using –  Oct 12 '16 at 22:24
  • 1
    Thanks, but `head()` isn't the same as providing a reproducible dataset. You should use `dput()` or a builtin data set. Hover your mouse over the R tag for more info on this. – Hack-R Oct 12 '16 at 22:33
  • @Hack-R edited the Q –  Oct 12 '16 at 22:45
  • Thanks. We are getting closer now. I tried to reproduce your problem with the data you gave, but with just those 6 lines it's not enough to make train and test. Also I just want to confirm which column is `sampleknn_test_target`? Also, I'm not sure if that data was from before or after `normalize`? When you edit it just keep in mind we should be able to copy and paste from your question and get the same error. – Hack-R Oct 12 '16 at 22:56
  • @Hack-R sampleknn_test_target is 'Satisfaction' column. Updated Q with normalised values –  Oct 12 '16 at 23:05

1 Answers1

0

From ?knn:

cl        factor of true classifications of training set

therefore you should write your statement:

m1 <- knn(train=sampleknn_train, test=sampleknn_test, cl=sampleknn_train_target,k=255)
HubertL
  • 19,246
  • 3
  • 32
  • 51