0

This is the first time i am using RTextTools. I have to implement an SVM classification on a collection of text documents. I am following this tutorial.

http://journal.r-project.org/archive/2013-1/collingwood-jurka-boydstun-etal.pdf

I am giving you my code, stepwise.

First i read my data and gave an index file. The index file had a list of all the text documents to be classified along with their individual tag. Example, if there is a file, abc.txt, belonging to the genre X, the index file will have it stored as abc.txt,X and so on.

    data = read_data('C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/folder', type=c('folder'), index = 'C:/Users/dell/Dropbox/Bundeli/corpus/wob/sklearn/index.txt')

Second, i create a doc-term matrix.

    doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

Third, i create a container which houses

    container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)

Here, data$genre is a label, where each document has its genre label given in exact order, aligned like an index.

But now, when i try to train the SVM on the container, using the following code,

    SVM <- train_model(container, "SVM")

It gives me this error.:-

    Error in svm.default(x = container@training_matrix, y = container@training_codes,  :   x and y don't match. 

When i see the structure of the "container', it shows me training codes empty. Like this.

    Slot "training_codes":
    factor(0)
    Levels: 

    Slot "testing_codes":
    factor(0)
    Levels: 

I can show you the full structure of the object "container" if you like, but this should not be happening. Can somebody please, please help? I have been desperately trying to look for some answer. Could there be something wrong with the index file of read_data, or is it a problem with the data$genre variable? Those are the new things, i may have gotten them incorrect. I will be most grateful.

*** SOLVED****

Checked, as suggested by @Theja, the str(data). Then changed as follows :

doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

This was also changed:

container <- create_container(doc_matrix, data$genre, trainSize=1:93, testSize=94:116, virgin=FALSE)
JCollerton
  • 3,227
  • 2
  • 20
  • 25

3 Answers3

0

You are on the right track with checking the structure of container to debug the problem.

Maybe use data$text or something similar in the create matrix step, since it seems like data is a list with genre as one of its elements (as seen in the create container step).

Check the structure of data using str(data) and pass on the right arguments to create_matrix().

  • Thank you @Theja. Analysing the structure really helped. – user3116297 Jul 30 '14 at 18:16
  • Another problem occurs however, with actual SVM implementation. It sorts the whole data only into 2 labels, instead of 3, as specified. I have checked the structure of each $ variable in the "data". Everything seems to be perfectly in order. Data is read correctly, and the labels are also being specified. Does anyone have any solution? – user3116297 Jul 30 '14 at 18:35
0

Even i faced exactly the same problem and solved it like this. Basically the problem is in

doc_matrix <- create_matrix(data, language="english", removeNumbers=TRUE, stemWords=TRUE, removeSparseTerms=.8)

Here the data format needs to a data frame built from vectors.

m <- data.frame(v1,v2)
doc_matrix <- create_matrix(m$v1, language="english", removeNumbers=TRUE,
                           stemWords=TRUE, removeSparseTerms=.998)
container <- create_container(doc_matrix, m$v2, trainSize=1:2500,
                             testSize=2501:2676, virgin=FALSE)

SVM <- train_model(container,"SVM")
SVM_CLASSIFY <- classify_model(container, SVM)

Thus if you use this and build ur doc_matrix from vectors, it will solve the problem!

  • can this svm gives result for classification by labelling test data based on training data? if yes http://stackoverflow.com/questions/29692571/svm-for-text-classification-in-r – KRU Apr 20 '15 at 08:02
0

I faced the same issue today. In my case it happened because the length of labels , did not match the length of the documents. Every document needs to be assigned a class/label.

In your case , you should have your text data and corresponding label as two separate columns , say

trainData$data ## contains your text 
trainData$label ## has your genre

Make sure, length(trainData$data) == length(trainData$label)

Indi
  • 1,401
  • 13
  • 30