So I have this massive tibble with tokens that I'm trying to do some filtering on and then transform into a document term matrix.
My problem is that the grouped filtering process runs really slow.
Does anyone have a good suggestion on how I can speed up the process or remove words that occur in more/less than n% documents? (I do not like the TM package, and I'm a beginner).
The code:
dtm <-
token %>%
count(document,word) %>%
filter(nchar(word)>2,
nchar(word)<30) %>% #Keep words with 2-30 characters
group_by(word) %>%
filter((n()/length(unique(dtm$document))) < 0.8, # Remove words that occurs in more <br>than n% documents
(n()/length(unique(dtm$document))) > 0.00001) %>% # Remove words that occurs in <br>less than n% documents
tidytext::cast_dtm(document = document, term = word, value = n)