svm implementation using RTextTools

Question

This is the first time i am using RTextTools. I have to implement an SVM classification on a collection of text documents. I am following this tutorial.

http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf

I am giving you my code, stepwise.

First i read my data and gave an index file. The index file had a list of all the text documents to be classified along with their individual tag. Example, if there is a file, abc.txt, belonging to the genre X, the index file will have it stored as abc.txt,X and so on.

    data = read_data('C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/folder', type=c('folder'), index = 'C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/index.txt')

Second, i create a doc-term matrix.

    doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

Third, i create a container which houses

    container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)

Here, data$genre is a label, where each document has its genre label given in exact order, aligned like an index.

But now, when i try to train the SVM on the container, using the following code,

    SVM <- train_model(container, "SVM")

It gives me this error.:-

    Error in svm.default(x = container@training_matrix, y = container@training_codes,  :   x and y don't match.

When i see the structure of the "container', it shows me training codes empty. Like this.

    Slot "training_codes":
    factor(0)
    Levels: 

    Slot "testing_codes":
    factor(0)
    Levels:

I can show you the full structure of the object "container" if you like, but this should not be happening. Can somebody please, please help? I have been desperately trying to look for some answer. Could there be something wrong with the index file of read_data, or is it a problem with the data$genre variable? Those are the new things, i may have gotten them incorrect. I will be most grateful.

*** SOLVED****

Checked, as suggested by @Theja, the str(data). Then changed as follows :

doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

This was also changed:

container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)

score 0 · Answer 1 · answered Jul 25 '14 at 12:08

0

You are on the right track with checking the structure of container to debug the problem.

Maybe use data$text or something similar in the create matrix step, since it seems like data is a list with genre as one of its elements (as seen in the create container step).

Check the structure of data using str(data) and pass on the right arguments to create_matrix().

answered Jul 25 '14 at 12:08

Theja Tulabandhula

851
6
5

Thank you @Theja. Analysing the structure really helped. – user3116297 Jul 30 '14 at 18:16
Another problem occurs however, with actual SVM implementation. It sorts the whole data only into 2 labels, instead of 3, as specified. I have checked the structure of each $ variable in the "data". Everything seems to be perfectly in order. Data is read correctly, and the labels are also being specified. Does anyone have any solution? – user3116297 Jul 30 '14 at 18:35

score 0 · Answer 2 · answered Feb 07 '15 at 19:45

Even i faced exactly the same problem and solved it like this. Basically the problem is in

doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

Here the data format needs to a data frame built from vectors.

m <- data.frame(v1,v2)
doc_matrix <- create_matrix(m$v1, language="english", removeNumbers=TRUE,
                           stemWords=TRUE, removeSparseTerms=.998)
container <- create_container(doc_matrix, m$v2, trainSize=1:2500,
                             testSize=2501:2676, virgin=FALSE)

SVM <- train_model(container,"SVM")
SVM_CLASSIFY <- classify_model(container, SVM)

Thus if you use this and build ur doc_matrix from vectors, it will solve the problem!

can this svm gives result for classification by labelling test data based on training data? if yes http://stackoverflow.com/questions/29692571/svm-for-text-classification-in-r — KRU, Apr 20 '15 at 08:02

score 0 · Answer 3 · answered Mar 29 '16 at 12:32

I faced the same issue today. In my case it happened because the length of labels , did not match the length of the documents. Every document needs to be assigned a class/label.

In your case , you should have your text data and corresponding label as two separate columns , say

trainData$data ## contains your text 
trainData$label ## has your genre

Make sure, length(trainData$data) == length(trainData$label)

svm implementation using RTextTools

3 Answers3