This is the first time i am using RTextTools. I have to implement an SVM classification on a collection of text documents. I am following this tutorial.
http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf
I am giving you my code, stepwise.
First i read my data and gave an index file. The index file had a list of all the text documents to be classified along with their individual tag. Example, if there is a file, abc.txt, belonging to the genre X, the index file will have it stored as abc.txt,X and so on.
data = read_data('C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/folder', type=c('folder'), index = 'C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/index.txt')
Second, i create a doc-term matrix.
doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)
Third, i create a container which houses
container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)
Here, data$genre is a label, where each document has its genre label given in exact order, aligned like an index.
But now, when i try to train the SVM on the container, using the following code,
SVM <- train_model(container, "SVM")
It gives me this error.:-
Error in svm.default(x = container@training_matrix, y = container@training_codes, : x and y don't match.
When i see the structure of the "container', it shows me training codes empty. Like this.
Slot "training_codes":
factor(0)
Levels:
Slot "testing_codes":
factor(0)
Levels:
I can show you the full structure of the object "container" if you like, but this should not be happening. Can somebody please, please help? I have been desperately trying to look for some answer. Could there be something wrong with the index file of read_data, or is it a problem with the data$genre variable? Those are the new things, i may have gotten them incorrect. I will be most grateful.
*** SOLVED****
Checked, as suggested by @Theja, the str(data)
. Then changed as follows :
doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)
This was also changed:
container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)