-1

Hello I am trying to classify text, here is the code

df <- read.csv("D:/AS/tokpedprepro.csv")

#sampling
set.seed(123)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]

#Convert to corpus
dfCorpus <- Corpus(VectorSource(df$text))
inspect(dfCorpus[1:20])

#convert DTM
dtm <- DocumentTermMatrix(dfCorpus)
inspect(dtm[1:4, 3:7])

#Data Partition
df.train <- df[1:20,]
df.test <- df[21:37,]

dtm.train <- dtm[1:20,]
dtm.test <- dtm[21:37,]

df.Corpus.train <- dfCorpus[1:20]
df.corpus.test <- dfCorpus[21:37]

train.class <- df$data.class

#TFIDF
dtm.train.knn <- DocumentTermMatrix(df.Corpus.train, control = list(weighting = 
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.train.knn)

The dimension is

[1]  20 194

dtm.test.knn <- DocumentTermMatrix(df.corpus.test, control = list(weighting = 
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.test.knn)

the dimension is

[1]  17 211

Then

knn.pred <- knn(dtm.train.knn, dtm.test.knn, train.class, k=1 )

But error 'train' and 'class' have different lengths

What should i do? Thanks

dikfaj
  • 1
  • 2
  • Please create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – phiver Jan 12 '20 at 09:08

1 Answers1

0

Your train.class is train.class <- df$data.class, but your dtm.train.knn is based on dfCorpus[1:20]. You need to change length of your train.class, probably as train.class <- df$data.class[1:20].

jyr
  • 690
  • 6
  • 20