There is documentation on creating a DTM (document term matrix) for the text2vec package, for example the following where a TFIDF weighting is applied after building the matrix:
data("movie_review")
N <- 1000
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
v <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(v)
it <- itoken(movie_review$review[1:N], preprocess_function = tolower,
tokenizer = word_tokenizer)
dtm <- create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf <- transformer_tfidf(dtm)
It is common practice to create a DTM based on a training dataset and use that dataset as input to a model. Then, when new data is encountered (a test set) one needs to create the same DTM on the new data (meaning all the same terms that were used in the training set). Is there anyway in the package to transform a new data set in this manner (in scikit we have a transform method for just this type of instance).