I have exactly the same question in the post except I used #quanteda to generate a dfm for svm model (cause I need to have exact same dfms for crossvalidation prediction): How to recreate same DocumentTermMatrix with new (test) data
However, my training set (trainingtfidf, as crude1.dtm in the post) has 170000+ documents and 670000+ in my test set (testtfidf, as crude2.dtm in the post) so I couldn't convert my new test set to either a matrix or a data frame:
>testtfidf <- as.data.frame(testtfidf)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
So I tried to do it as a dfm directly:
# Keep the column names in test set which are shared with training set
testtfidf1 <- testtfidf[, intersect(colnames(testtfidf), colnames(trainingtfidf))]
# Extracting column names in training set but not in testset
namevactor <- colnames(protocoltfidf)[which(!colnames(protocoltfidf) %in% colnames(testtfidf1)==TRUE)]
# Add the columns back to test set and set the elements as NA since the terms do that exist in the test set
testtfidf1[,namevactor] <- NA
But it gave me the error for the last line:
Error in intI(i, n = di[margin], dn = dn[[margin]], give.dn = FALSE) :
invalid character indexing
Could any one help me with this? I've been struggling for two days and I'm so close to get this done! Thanks!