grouped filter process really slow

Question

So I have this massive tibble with tokens that I'm trying to do some filtering on and then transform into a document term matrix.

My problem is that the grouped filtering process runs really slow.

Does anyone have a good suggestion on how I can speed up the process or remove words that occur in more/less than n% documents? (I do not like the TM package, and I'm a beginner).

The code:

dtm <-
  token %>% 
  count(document,word) %>%
  filter(nchar(word)>2,
         nchar(word)<30) %>% #Keep words with 2-30 characters
  group_by(word) %>%
  filter((n()/length(unique(dtm$document))) < 0.8,       # Remove words that occurs in more <br>than n% documents
         (n()/length(unique(dtm$document))) > 0.00001) %>%   # Remove words that occurs in <br>less than n% documents
  tidytext::cast_dtm(document = document, term = word, value = n)

score 0 · Answer 1 · answered Nov 16 '21 at 05:58

Looks to me like the filtering does not need to happen within the grouping, and neither does computing how many documents you have:

library(tidyverse)
library(tidytext)
data(tate_text, package = "modeldata")

tidy_tate <- tate_text %>% unnest_tokens(word, title)

tate_ids <- n_distinct(tate_text$id)

bench::mark(
    original = tidy_tate %>% 
        count(id, word) %>%
        filter(nchar(word) > 2,
               nchar(word) < 30) %>%
        group_by(word) %>%
        filter((n()/length(unique(tate_text$id))) < 0.8,       
               (n()/length(unique(tate_text$id))) > 0.00001) %>%  
        tidytext::cast_dtm(document = id, term = word, value = n),
    new = tidy_tate %>% 
        count(id, word) %>%
        filter(nchar(word) > 2,
               nchar(word) < 30) %>%
        group_by(word) %>%
        mutate(uses = n()) %>%
        ungroup() %>%
        filter(uses/tate_ids < 0.8, uses/tate_ids > 0.00001) %>%  
        tidytext::cast_dtm(document = id, term = word, value = n)
)

#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original      669ms    669ms      1.49  976.87MB     61.3
#> 2 new           110ms    112ms      8.70    8.54MB     13.9

^{Created on 2021-11-15 by the reprex package (v2.0.1)}

The new way is about 7 times faster but it gives the same result.

grouped filter process really slow

1 Answers1