I am using R text2vec package for creating document-term-matrix. Here is my code:
library(lime)
library(text2vec)
# load data
data(train_sentences, package = "lime")
#
tokens <- train_sentences$text %>%
word_tokenizer
it <- itoken(tokens, progressbar = FALSE)
stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )
dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")
Another method is hash_vectorizer() instead of vocab_vectorizer() as:
h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)
But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().
I appreciate your responses. Thanks, Sam