R: Recreate same document term matrix with new data

Question

I have exactly the same question in the post except I used #quanteda to generate a dfm for svm model (cause I need to have exact same dfms for crossvalidation prediction): How to recreate same DocumentTermMatrix with new (test) data

However, my training set (trainingtfidf, as crude1.dtm in the post) has 170000+ documents and 670000+ in my test set (testtfidf, as crude2.dtm in the post) so I couldn't convert my new test set to either a matrix or a data frame:

>testtfidf <- as.data.frame(testtfidf)
Error in asMethod(object) : 
      Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

So I tried to do it as a dfm directly:

# Keep the column names in test set which are shared with training set
testtfidf1 <- testtfidf[, intersect(colnames(testtfidf), colnames(trainingtfidf))]
# Extracting column names in training set but not in testset
namevactor <- colnames(protocoltfidf)[which(!colnames(protocoltfidf) %in% colnames(testtfidf1)==TRUE)]
# Add the columns back to test set and set the elements as NA since the terms do that exist in the test set
testtfidf1[,namevactor] <- NA

But it gave me the error for the last line:

Error in intI(i, n = di[margin], dn = dn[[margin]], give.dn = FALSE) : 
invalid character indexing

Could any one help me with this? I've been struggling for two days and I'm so close to get this done! Thanks!

score 0 · Answer 1 · answered Jun 30 '16 at 15:45

This answer is still a bit rough, but here is what I see might be the problem. It looks like you are using the package Tsparse.R. It has a function intI(). The function intI() is defined at the bottom of the post. This is where your error is occurring it looks like. But, you possibly can avoid using that function altogether. Consider the following:

It seems that protocoltfidf is the original data set. The second line of your code snippet extracts the column names from protocoltfidf that are not in the test data set. So "namesvactor" is a vector of strings, none of which are column names in testtfidf1.

This may be oversimplifying the problem, but your issue might just be that you are trying to assign NA values to columns in testtfidf1 that do not even exist. Remember that "namesvactor" contains strings of column names that do not exist in testtfidf1. So the line testtfidf1[,namevactor] is referencing columns in testtfidf1 that do not even exist. That is probably why it is having problems finding those columns.

Maybe try just creating new columns in testtfidf1 with the column names being the strings in "namesvactor" and set the values in those columns to NA.

intI <- function(i, n, dn, give.dn = TRUE)
{
## Purpose: translate numeric | logical | character index
##      into 0-based integer
## ----------------------------------------------------------------------
## Arguments: i: index vector (numeric | logical | character)
##        n: array extent           { ==  dim(.) [margin] }
##       dn: character col/rownames or NULL { == dimnames(.)[[margin]] }
## ----------------------------------------------------------------------
## Author: Martin Maechler, Date: 23 Apr 2007

has.dn <- !is.null.DN(dn)
DN <- has.dn && give.dn
if(is(i, "numeric")) {
storage.mode(i) <- "integer"
if(anyNA(i))
    stop("'NA' indices are not (yet?) supported for sparse Matrices")
if(any(i < 0L)) {
    if(any(i > 0L))
    stop("you cannot mix negative and positive indices")
    i0 <- (0:(n - 1L))[i]
} else {
    if(length(i) && max(i, na.rm=TRUE) > n)
    stop(gettextf("index larger than maximal %d", n), domain=NA)
    if(any(z <- i == 0)) i <- i[!z]
    i0 <- i - 1L        # transform to 0-indexing
}
if(DN) dn <- dn[i]
}
else if (is(i, "logical")) {
if(length(i) > n)
    stop(gettextf("logical subscript too long (%d, should be %d)",
          length(i), n), domain=NA)
i0 <- (0:(n - 1L))[i]
if(DN) dn <- dn[i]
} else { ## character
if(!has.dn)
    stop("no 'dimnames[[.]]': cannot use character indexing")
i0 <- match(i, dn)
if(anyNA(i0)) stop("invalid character indexing")
if(DN) dn <- dn[i0]
i0 <- i0 - 1L
}
if(!give.dn) i0 else list(i0 = i0, dn = dn)
} ## {intI}

R: Recreate same document term matrix with new data

1 Answers1