0

So I have this massive tibble with tokens that I'm trying to do some filtering on and then transform into a document term matrix.

My problem is that the grouped filtering process runs really slow.

Does anyone have a good suggestion on how I can speed up the process or remove words that occur in more/less than n% documents? (I do not like the TM package, and I'm a beginner).

The code:

dtm <-
  token %>% 
  count(document,word) %>%
  filter(nchar(word)>2,
         nchar(word)<30) %>% #Keep words with 2-30 characters
  group_by(word) %>%
  filter((n()/length(unique(dtm$document))) < 0.8,       # Remove words that occurs in more <br>than n% documents
         (n()/length(unique(dtm$document))) > 0.00001) %>%   # Remove words that occurs in <br>less than n% documents
  tidytext::cast_dtm(document = document, term = word, value = n)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
MariusJ
  • 71
  • 6

1 Answers1

0

Looks to me like the filtering does not need to happen within the grouping, and neither does computing how many documents you have:

library(tidyverse)
library(tidytext)
data(tate_text, package = "modeldata")

tidy_tate <- tate_text %>% unnest_tokens(word, title)

tate_ids <- n_distinct(tate_text$id)

bench::mark(
    original = tidy_tate %>% 
        count(id, word) %>%
        filter(nchar(word) > 2,
               nchar(word) < 30) %>%
        group_by(word) %>%
        filter((n()/length(unique(tate_text$id))) < 0.8,       
               (n()/length(unique(tate_text$id))) > 0.00001) %>%  
        tidytext::cast_dtm(document = id, term = word, value = n),
    new = tidy_tate %>% 
        count(id, word) %>%
        filter(nchar(word) > 2,
               nchar(word) < 30) %>%
        group_by(word) %>%
        mutate(uses = n()) %>%
        ungroup() %>%
        filter(uses/tate_ids < 0.8, uses/tate_ids > 0.00001) %>%  
        tidytext::cast_dtm(document = id, term = word, value = n)
)

#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original      669ms    669ms      1.49  976.87MB     61.3
#> 2 new           110ms    112ms      8.70    8.54MB     13.9

Created on 2021-11-15 by the reprex package (v2.0.1)

The new way is about 7 times faster but it gives the same result.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48