I need to produce term document matrices with exactly the same columns as in the dtm I run the modelling on, otherwise I cannot use random forest model on new documents.
In quanteda you can set the features of a test set identical to that of a training set using dfm_select()
. For example, to make dfm1
below have identical features to dfm2
:
txts <- c("a b c d", "a a b b", "b c c d e f")
(dfm1 <- dfm(txts[1:2]))
## Document-feature matrix of: 2 documents, 4 features (25% sparse).
## 2 x 4 sparse Matrix of class "dfmSparse"
## features
## docs a b c d
## text1 1 1 1 1
## text2 2 2 0 0
(dfm2 <- dfm(txts[2:3]))
## Document-feature matrix of: 2 documents, 6 features (41.7% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f
## text1 2 2 0 0 0 0
## text2 0 1 2 1 1 1
dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE)
## kept 4 features, padded 2 features
## Document-feature matrix of: 2 documents, 6 features (50% sparse).
## 2 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c d e f
## text1 1 1 1 1 0 0
## text2 2 2 0 0 0 0
For feature-context matrixes (what text2vec needs for an input) however, this will not work because the co-occurrences (at least those computed with a window rather than document context) are interdependent across features, so you cannot simply add and remove them in the same way.