0

I have built a Bayes Classifier (from bnlearn package, since I want to do a multinomial Bayes model) on a dataset containg text messages.

My Training set looks like the below: I have to classify a given message into a particular CLASS.

message                
Worth reading mums;;;hope we too could
Musical bonding classes for a 9 month old- Yay or Nay? Should we start or wait for a few more months?
Girls...what plans for valentine...?.

CLASS
1
2
3

dataset <- read.csv("Traindataset.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)
df <- Corpus(VectorSource(dataset$message))
df1 <- tm_map(df, stripWhitespace)
df1 <- tm_map(df1, tolower)
df1 <- tm_map(df1, removePunctuation)
df1 <- tm_map(df1, removeNumbers)
df1 <- tm_map(df1, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(df1)
dtm1 <- as.matrix(dtm)
dtm1 <- as.data.frame(cbind(dtm1, CLASS = dataset$CLASS))
dtm1 <- as.data.frame(lapply(dtm1, as.factor))
bn <- naive.bayes(dtm1, "CLASS")
pred = predict(bn, dtm1)

When I predict on the same data it works just fine without throwing any error. The problem I'm facing is when I test the model bn on unseen data tst it gives me an error that the network and the data have different number of variables. Need help on this.

tst <- read.csv("TestDataset.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)   
df <- Corpus(VectorSource(tst$message))    
df1 <- tm_map(df, stripWhitespace)
df1 <- tm_map(df1, tolower)
df1 <- tm_map(df1, removePunctuation)
df1 <- tm_map(df1, removeNumbers)
df1 <- tm_map(df1, removeWords, stopwords("english"))    
dtmtest <- DocumentTermMatrix(df1)    
dtmtest1 <- as.matrix(dtmtest)
dtmtest1 <- as.data.frame(cbind(dtmtest1, CLASS = tst$CLASS))
dtmtest1 <- as.data.frame(lapply(dtmtest1, as.factor))

> pred = predict(bn, dtmtest1)
Error in check.bn.vs.data(x, data) : 
  the network and the data have different numbers of variables.

EDIT:

> names(bn$tables) %in% names(dtmtest1)
logical(0)
> s <- names(bn$nodes) %in% names(dtmtest1)
> length(s)
[1] 6077
> sum(names(bn$nodes) %in% names(dtmtest1))
[1] 6057
> length(bn$nodes)
[1] 6077

> length(names(dtmtest1))
[1] 12509
> dtmtest1


> dtmtest
A document-term matrix (2309 documents, 12508 terms)

Non-/sparse entries: 51826/28829146
Sparsity           : 100%
Maximal term length: 123 
Weighting          : term frequency (tf)

> dtm
A document-term matrix (872 documents, 6076 terms)

Non-/sparse entries: 17041/5281231
Sparsity           : 100%
Maximal term length: 123 
Weighting          : term frequency (tf)
> 
user1946217
  • 1,733
  • 6
  • 31
  • 40
  • What's the result of `names(bn$tables) %in% names(dtmtest1)`? – jbaums Jun 11 '14 at 07:27
  • Edited the post to include results for your query. – user1946217 Jun 11 '14 at 07:54
  • Sorry, I misread your post and thought you were using `naiveBayes` from `e1071` (you can remove that info about `bn%tables`). Anyway, I think the error message is pretty clear - `dtmtest1` does not include exactly the same set of variables as did the training data. I believe `sort(names(dtmtest1))` should match `sort(names(dtm1))`. – jbaums Jun 11 '14 at 08:17
  • there are only 6057 terms which are matching in the training and testing... So I still do not have the same number of variables in my test set... how can I handle this – user1946217 Jun 11 '14 at 08:18
  • Predicting with bnlearn requires complete data (non-missing & same variables). As an alternative you can make the predictions on the bnlearn network on available data using `gRain`. [This](http://stackoverflow.com/questions/23816994/multinomial-naive-bayes-classifier-in-r/23859814#23859814) may get you started but you will need to wrap the steps in a function to apply it over your full validation dataset. – user20650 Jun 11 '14 at 11:52

0 Answers0