Classification Algorithm for text using R

Question

I wanted to predict class of new document using historical data of text "description" and "class"

Below script I am using , but for new document which I want to predict I am not getting better accuracy , can anyone help me to know which algorithm can be used to increase accuracy. Please advice.

library(plyr)
library(tm)
library(e1071)

setwd("C:/Data")

past <- read.csv("Past - Copy.csv",header=T,na.strings=c(""))
future <- read.csv("Future - Copy.csv",header=T,na.strings=c(""))

training <- rbind.fill(past,future)

Res_Desc_Train <- subset(training,select=c("Class","Description"))

##Step 1 : Create Document Matrix of ticket Descriptions available past data

docs <- Corpus(VectorSource(Res_Desc_Train$Description))
docs <-tm_map(docs,content_transformer(tolower))

#remove potentially problematic symbols
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords('english'))


#inspect(docs[440])
dataframe<-data.frame(text=unlist(sapply(docs, `[`, "content")), stringsAsFactors=F)

dtm <- DocumentTermMatrix(docs,control=list(stopwords=FALSE,wordLengths =c(2,Inf)))

##Let's remove the variables which are 95% or more sparse.
dtm <- removeSparseTerms(dtm,sparse = 0.95)

Weighteddtm <- weightTfIdf(dtm,normalize=TRUE)
mat.df <- as.data.frame(data.matrix(Weighteddtm), stringsAsfactors = FALSE)
mat.df <- cbind(mat.df, Res_Desc_Train$Class)
colnames(mat.df)[ncol(mat.df)] <- "Class"
Assignment.Distribution <- table(mat.df$Class)

Res_Desc_Train_Assign <- mat.df$Class

Assignment.Distribution <- table(mat.df$Class)

### Feature has different ranges, normalizing to bring ranges from 0 to 1
### Another way to standardize using z-scores

normalize <- function(x) {
  y <- min(x)
  z <- max(x)
  temp <- x - y
  temp1 <- (z - y)
  temp2 <- temp / temp1
  return(temp2)
}
#normalize(c(1,2,3,4,5))

num_col <- ncol(mat.df)-1
mat.df_normalize <- as.data.frame(lapply(mat.df[,1:num_col], normalize))
mat.df_normalize <- cbind(mat.df_normalize, Res_Desc_Train_Assign)
colnames(mat.df_normalize)[ncol(mat.df_normalize)] <- "Class"

#names(mat.df)
outcomeName <- "Class"

train = mat.df_normalize[c(1:nrow(past)),]
test = mat.df_normalize[((nrow(past)+1):nrow(training)),]


train$Class <- as.factor(train$Class) 

###SVM Model
x <- subset(train, select = -Class)
y <- train$Class
model <- svm(x, y, probability = TRUE) 
test1 <- subset(test, select = -Class)
svm.pred <- predict(model, test1, decision.values = TRUE, probability = TRUE)
svm_prob <- attr(svm.pred, "probabilities")

finalresult <- cbind(test,svm.pred,svm_prob)

score 0 · Answer 1 · answered Jul 19 '17 at 12:50

0

Let's try to tune your SVM model?

You are running model using default parameters so not able to get better accuracy. Running model is an iterative process where you change the parameter, run model, check accuracy and then repeat the whole process again.

model <- tune(svm, train.x=x, train.y=y, kernel="radial", ranges=list(cost=10^(-1:2), gamma=c(.5,1,2)))
print(model)
#select values of cost & gamma from here and pass it to tuned_model

tuned_model <- svm(x, y, kernel="radial", cost=<cost_from_tune_model_output>, gamma=<gamma_from_tune_model_output>)
#now check accuracy of this model using test dataset and accordingly adjust tune parameter. Repeat the whole process again.

Hope this helps!

answered Jul 19 '17 at 12:50

Prem

11,775
1
19
33

Thanks for your help will use solution which you share and will check if accuracy can be increased, Actually i am getting very low around 52% of accuracy – user3734568 Jul 19 '17 at 12:57
In that case you may also need to increase your training dataset so that model learns properly. – Prem Jul 19 '17 at 13:02
thanks for your suggestion will check if I can get more dataset to train model , currently i have 13383 document in my train data set. – user3734568 Jul 19 '17 at 13:08

Classification Algorithm for text using R

1 Answers1