Introduction
In my school I must participate at a challenge for see if I have understand how work the text mining in R.
For that, we have 1050 files of different type (shopping, home, account, etc.).
The goal of this exercise is the development of a script for finds the type of a HTML page with a classifier, the time and the precision is very important.
My team and me we have use for begin a kppv classifier, but we have 40% of error with that. So we have to decide to use the classifier SVM !
Research
With several docs, and with much patience we have to create a script for creating an SVM model with all the document. And when we want see if the file put in the model is recognized, it's work.
But when we want put a html page, the output change, and we don't know what make with that.
Code
main.r
library("e1071")
library("tm")
splash=function(x){
res=NULL
for (i in x) res=paste(res, i)
res
}
#Suppression des script s(<script .... </script>)
removeScript=function(t){
sp=strsplit(t, "<script")
vec=sapply(sp[[1]], gsub, pattern=".*</script>", replace=" ")
PlainTextDocument(splash(vec))
}
#Suppression de toutes les balises
removeBalises=function(x){
t1=gsub("<[^>]*>", " ", x)
PlainTextDocument(gsub("[ \t]+"," ",t1))
}
clean_corpus = function(corp)
{
corp<-tm_map(corp,content_transformer(tolower))
corp<-tm_map(corp,content_transformer(splash))
corp<-tm_map(corp,content_transformer(removeScript))
corp<-tm_map(corp,content_transformer(removeBalises))
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,removeWords,words=stopwords('en'))
corp<-tm_map(corp,stemDocument)
corp<-tm_map(corp,removePunctuation)
corp
}
training_set = readRDS(file = "training_set.rds")
term20 = readRDS(file = "term20.rds")
classes = c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))
model <-svm(x=training_set[,ncol(training_set)],y=classes,type='C',kernel='linear', cost=1, gamma=1)
summary(model)
pred = predict(model, classes)
pred
testingFile = function()
{
src = DirSource("testing")
corp = VCorpus(src)
clean_corpus(corp);
}
testCorpus = testingFile()
testCorpus
testdtm = DocumentTermMatrix(testCorpus, control=list(weighting=weightTf))
testmat = as.matrix(testdtm)
testpreds = sapply(1, function(i)
{
v = testmat[i, ][term20]
#v[is.na(v)] = 0
predict(model, v)
})
testpreds
script for the recup of text
library("tm")
library("magrittr")
library("SnowballC")
library("nnet")
acc<-VCorpus(DirSource("training2016/", recursive=TRUE))
#acc<-VCorpus(DirSource("trainingLight/", recursive=TRUE))
[...]
dtm = DocumentTermMatrix(clean_corpus(acc))
dtm
term20 = findFreqTerms(dtm, lowfreq = 20)
freqs = sapply(1:50, function(i) length(findFreqTerms(dtm, lowfreq = i)))
plot(freqs)
dtm20 = dtm[, term20]
dim(dtm20)
m = as.matrix(dtm20)
classes = c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))
#classes = c(rep(1,150), rep(2,150), rep(3,150))
training_set = cbind(m, classes)
saveRDS(training_set, file = "training_set.rds")
saveRDS(term20, file = "term20.rds")
Result
When we want, put a only one file, he output a list of word with a value (which is the class).
This output can be useful, but we don't know how.
We want know How use this output.
The output
accessori "5"
account "1"
ahead "1"
airport "4"
also "1"
amp "1"
anyon "1"
appl "7"
around "1"
audio "1"
australia "1"
avail "1"
...