3

I'm using the dataset found here: http://archive.ics.uci.edu/ml/datasets/Qualitative_Bankruptcy

When running code: library(caret)

bank <- read.csv("Qualitative_Bankruptcy.data.txt", header=FALSE, na.strings = "?", 
             strip.white = TRUE)

x=bank[1:6]
y=bank[7]

bank.knn <- train(x, y, method= "knn", trControl = trainControl(method = "cv"))

I get the following error: Error: nrow(x) == n is not TRUE

The only example I've found is Error: nrow(x) == n is not TRUE when using Train in Caret ; my Y is already a factor vector with two classes, all the X features are factors as well. I've tried using as.matrix and as.data.frame on both the X and Y without success.

nrow(x) is equal to 250, but I'm not sure what the n is referring to in the package.

Community
  • 1
  • 1
Matt Inwood
  • 137
  • 3
  • 11

2 Answers2

6

y is not actually a vector, but a data.frame with one column because bank[7] does not convert the 7th column into a vector, so length(y) is 1. Use bank[, 7] instead. It does not make a difference for x but it could as well be generated by bank[, 1:6].

Additionally to make KNN work you probably have to convert the x data.frame that consists of factor variables to numeric dummy variables.

x=model.matrix(~. - 1, bank[, 1:6])
y=bank[, 7]
bank.knn <- train(x, y, method= "knn", 
                  trControl = trainControl(method = "cv"))
thie1e
  • 3,588
  • 22
  • 22
  • so "-1" inside the model.matrix takes out the last column? and if i put "-2" does that take out the last two columns or the 2nd to last column? can i make it more dynamic by calling a variable name instead of an integer? – alwaysaskingquestions Jun 12 '16 at 17:38
  • @alwaysaskingquestions The `-1` is special formula syntax for excluding the intercept, which would be a column of only ones. You can of course substitute `[, 1:6]` with your choice of variables. There are lots of [tutorials](http://www.statmethods.net/management/subset.html) on how to subset data frames and matrices in R. – thie1e Jun 15 '16 at 20:44
0

I'm not a caret user but I think you have two problems. The extraction method you used did not deliver an atomic vector but rahter a list that contained a vector. If you asked for length(y) you get 1 rather than 250. The first error is easily solved by changing to this definition of y:

 y <- bank[[7]]  # extract a vector rather than a sublist

Then things get messy. The KNN method expects continuous data (and the error messages you get indicate the caret's author considers it a "regression method" and you are passing factor data, so you therefore need to choose a classification method instead.

IRTFM
  • 258,963
  • 21
  • 364
  • 487