0

I asked this questions on the R mailing list, but I think here is a better place to look for answers and tips.

I'm currently working on text classification of student's essays, trying to identify texts that fit to a certain class or not. I use texts from one semester (A) for training and texts from another semester (B) for testing the classifier. My workflow is like this:

  • read all texts from A, build a DTM(A) with about 1387 terms (package tm)
  • read all texts from B, build a DTM(B) with about 626 terms
  • train the classifier with DTM(A), using a SVM (package e1071)

Now I want to classify all texts in DTM(B) using the classifyer. But when I try to use predict(), I always get the error message: Error in eval(expr, envir, enclos) : object 'XY' not found. As I found out, the reason for this is that DTM(A) and DTM(B) have a different number of terms and consequently not every term used for training the model is available in DTM(B).

Sure it's problematic to do a classification with two different feature spaces, but I want to finde a solution for this "real-world-problem". The idea is to identify wether or not a text turned in by a student fits the other texts or not. So my naive idea is to develop a prediction model with texts from one semester [DTM(A)] and then use this model to evaluate a new text from another semester [DTM(B)]. As the new text isn't in the original DTM, the feature spaces differ. So far I only found code that uses a DTM created from all texts, but this would require to create a new DTM(A)` and re-train the SVM each and every time.

My question is: how should/do I deal with this? Should I match the terms used in DTM(A) and DTM(B), in order to get an identical feature space? This could be achieved either reducing the number of terms in DTM(A) or adding several empty/NA columns to DTM(B). Or is there another solution to my problem?

Kind regards

Björn

PsyR
  • 21
  • 6
  • I guess http://stackoverflow.com/questions/39721737/how-to-handle-errors-in-predict-function-of-r can help you. – abhiieor Feb 20 '17 at 09:41
  • Thanks, but the posting you linked to is not about different feature spaces, which means different columns in the DTM, but about different levels of categorical variables. – PsyR Feb 20 '17 at 10:23
  • You can generalize. No machine learning method can deal with new predictors; which essentially you get when you create dummy variables for a categorical variable with new levels. So essentially as said in my answer there you need to maintain a list of the variables which are part of the training and hence model. Filter down you test/prediction data based on this list and then go ahead with scoring on trained object. – abhiieor Feb 20 '17 at 12:17
  • You may want to do train test divide in a smart way so that train data contains as many as words possible. – abhiieor Feb 20 '17 at 12:18

2 Answers2

0

After some more experiments and some research, I came across the RTextTools package and its function "create_matrix()". This function creates a new DTM and you can also adjust the matrix to the originalMatrix, which has been used to train the model. This was exactly what I was looking for. So I looked at the original code (https://github.com/timjurka/RTextTools/blob/master/RTextTools/R/create_matrix.R) and came up with this:

# get all the terms which are in the training df, but not in the test df
terms <- colnames(train.df[,which(!colnames(train.df) %in% colnames(test.df))])
# weight is set, this is just in case that weightTfIdf was used, otherwise it should be 0
weight <- 0.000000001
# now create a new matrix with the missing terms
amat <- matrix(weight, nrow = nrow(test.df), ncol = length(terms))
colnames(amat) <- terms
rownames(amat) <- rownames(test.df)

# create a new test df with the original values plus the new matrix with the missing terms
test.df.fixed <- cbind(test.df[,which(colnames(test.df) %in% colnames(train.df))],amat)
test.df.fixed <- test.df.fixed[, sort(colnames(test.df.fixed))]

The result is a test data frame which has all features (columns) of the data frame that was used for training. So it's basically an "up-filtering" instead of a down-filtering. A quick test showed it works quite well (Accuracy: .91, Kappa: .88).

PsyR
  • 21
  • 6
0

In a real world setting, your training and testing data are completely indepenent. This means that you know nothing about your test documents up front. The best way to solve your problem, with this in mind, is to base the TDMs for dataset B on the vocabulary used in dataset A (e.g. only count words that occurred in A).

PinkFluffyUnicorn
  • 1,260
  • 11
  • 20