I'm trying to implement a Naive Bayes classifier on a data set which contains text data in the form of complaints from customers (Complaint) and Reddit comments (General_Text). The whole set has 250'000 Texts for each category. However, I use only 1000 Texts per category in the example postet here. I get the same result with the whole data set. I have done the text preprocessing with the "tm" package previously and it should not be an issue!
The data frame is structured as follows with 1000 entries for Complaint and General_Text:
type text
"General_Text" "random words"
"Complaint" "other random words"
For the Classification Task i split the data into a Training set on which the algorithm should learn and a test set to measure the accuracy. The naive Bayes algorithm is from the "e1071" library.
library(plyr)
library(e1071)
library(caret)
library(MLmetrics)
#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")
# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)
# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]
# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))
# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)
# Convert counts into "Yes" or "No"
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}
# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts)
# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))
# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)
# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)
# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)
# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)
I'm sorry that I cannot deliver the data and thus a reproducible example. The result which the code delivers is pretty demoralizing: It identifies all the texts as Complaints and none as General Texts.
> ConfusionMatrix(nb_test_pred, testData$type)
y_pred
y_true Complaint General_Text
Complaint 300 0
General_Text 300 0
I also get the following error message: In data.matrix(newdata) : NAs introduced by coercion
Could anyone clarify if I made any mistakes in my code or give me a heads up if someone had a similar issue?