0

I am using R text2vec package for creating document-term-matrix. Here is my code:

library(lime)
library(text2vec) 

# load data
data(train_sentences, package = "lime")  

#
tokens <- train_sentences$text %>%  
   word_tokenizer

it <- itoken(tokens, progressbar = FALSE)

stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%   
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )

dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")

Another method is hash_vectorizer() instead of vocab_vectorizer() as:

h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)

But when I am using hash_vectorizer, there is no option for stopwords removal and pruning vocabulary. In a study case, hash_vectorizer works better than vocab_vectorizer for me. I know one can remove stopwords after creating dtm or even when creating tokens. Is there any other options, similar to the vocab_vectorizer and how it is created. Particularly I am interested in a method that also supports pruning vocabulary similar to prune_vocabulary().

I appreciate your responses. Thanks, Sam

Sam S.
  • 627
  • 1
  • 7
  • 23

1 Answers1

2

This is not possible. The whole point of using hash_vectorizer and feature hashing is to avoid hashmap lookups (getting index of a given word). Removing stop-words is essentially the thing - check whether word is in the set of stop-words. Usually it is recommended to use hash_vectorizer only if you dataset is very big and if it takes a lot of time/memory to build vocabulary. Otherwise according to my experience vocab_vectorizer with prune_vocabulary will perform at least not worse.

Also if you use hash_vectorized with small hash_size it acts as a dimensionality reduction step and hence can reduce variance for your dataset. So if your dataset is not very big I suggest to use vocab_vectorizer and play with prune_vocabulary parameters to reduce vocabulary and document-term-matrix size.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks so much Dmitriy for your response. In the below link, they used hash_vectorizer and corresponding dtm in xgboost for classification. When I am using vocab_vectorizer instead of hash_vectorizer, the prediction accuracy drops from ~86 to ~74 (that is a bit strange??); that is the reason I like to use hash_vectorizer. As you said it is not possible to use them in hash_vectorizer, probably the solution will be removing stopwords after creating dtm. https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html – Sam S. Nov 04 '18 at 08:13