0

I am using quanteda to build document feature matrices from different data sources. While building dfm with parliamentary speech data and Facebook data takes just a few minutes, it takes more than 7 hours to compile a dfm based on a Twitter dataset. The three datasets are approximately equal in size (60mb).

R is updated (R version 3.5.3), RStudio is updated (Version 1.3.923) and quanteda is updated (Version 2.0.1) and I am using a MacBook Pro 2018 (OS X version 10.14.5).

Running the exact same code on another machine with an older version of quanteda (version 1.5.2) takes just a few minutes instead of several hours.

Unfortunately, I cannot provide a reproducible example since the data cannot be shared.

Do you have any ideas what the problem might be and how I can circumvent it?

Here are the sessionInfo() and code plus output from the problematic machine that needs more than 7 hours for creating the dfm:

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:   /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_2.0.1  forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3     readr_1.3.1     tidyr_1.0.2    
[8] tibble_3.0.0    ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
[1] tinytex_0.20       tidyselect_1.0.0   xfun_0.12          haven_2.2.0        lattice_0.20-40    colorspace_1.4-1  
[7] vctrs_0.2.4        generics_0.0.2     yaml_2.2.1         rlang_0.4.5        pillar_1.4.3       glue_1.3.2        
[13] withr_2.1.2        DBI_1.1.0          dbplyr_1.4.2       modelr_0.1.6       readxl_1.3.1       lifecycle_0.2.0   
[19] munsell_0.5.0      gtable_0.3.0       cellranger_1.1.0   rvest_0.3.5        fansi_0.4.1        broom_0.5.5       
[25] Rcpp_1.0.4         scales_1.1.0       backports_1.1.5    RcppParallel_5.0.0 jsonlite_1.6.1     fs_1.3.2          
[31] fastmatch_1.1-0    stopwords_1.0      hms_0.5.3          stringi_1.4.6      grid_3.5.3         cli_2.0.2         
[37] tools_3.5.3        magrittr_1.5       crayon_1.3.4       pkgconfig_2.0.3    ellipsis_0.3.0     Matrix_1.2-18     
[43] data.table_1.12.8  xml2_1.3.0         reprex_0.3.0       lubridate_1.7.4    assertthat_0.2.1   httr_1.4.1        
[49] rstudioapi_0.11    R6_2.4.1           nlme_3.1-145       compiler_3.5.3    

> dtmTW <- dfm(corpTW, groups = "user.id",
+              remove = stopwords("de"), 
+              tolower = TRUE,
+              remove_punct = TRUE,
+              remove_numbers = TRUE,
+              remove_twitter = TRUE, 
+              remove_url = TRUE,
+              dictionary = myDict,
+              verbose = TRUE)
Creating a dfm from a corpus input...
  ...lowercasing
  ...found 886,166 documents, 543,035 features
  ...grouping texts
  ...applying a dictionary consisting of 1 key
  ...removed 0 features
  ...complete, elapsed time:  25338 seconds.
  Finished constructing a 408 x 1 sparse dfm.
  Warning message:
 'remove_twitter' is deprecated; for FALSE, use 'what = "word"' instead. 

Here are the sessionInfo() and code plus output from the machine that creates the same dfm in less than a minute:

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

 Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] quanteda_1.5.2  forcats_0.4.0   stringr_1.4.0   dplyr_0.8.4     purrr_0.3.3    
 [6] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0

> dtmTW <- dfm(corpTW, groups = "user.id",
+              remove = stopwords("de"), 
+              tolower = TRUE,
+              remove_punct = TRUE,
+              remove_numbers = TRUE,
+              remove_twitter = TRUE, 
+              remove_url = TRUE,
+              dictionary = myDict, 
+              verbose = TRUE)
Creating a dfm from a corpus input...
   ... lowercasing
   ... found 886,166 documents, 471,981 features
   ... grouping texts
   ... applying a dictionary consisting of 1 key
   ... removed 0 features
   ... created a 408 x 1 sparse dfm
   ... complete. 
Elapsed time: 108 seconds.
wldmstr
  • 1
  • 1
  • 1
    Can you supply `sessionInfo()` for each machine? And perhaps show the code and output, with `dfm(..., verbose = TRUE)`? And: You should be comparing the same code on different machines, not different tasks on different machines. – Ken Benoit Apr 02 '20 at 22:02
  • The additional information are now added to the post. – wldmstr Apr 03 '20 at 20:15
  • I could replicate the issue and patched via https://github.com/quanteda/quanteda/pull/1920. Can you test the Github version? – Kohei Watanabe Apr 04 '20 at 10:08
  • The Github version (2.0.2) is faster than version 2.0.1, but it is just slightly faster. I have tested it with a Facebook corpus, which took 1004 seconds on version 2.0.1 on the problematic machine described above. Now it takes 865 seconds on the same machine (version 2.0.2). However the second machine described in the post just needs 51.8 seconds for the same task. – wldmstr Apr 04 '20 at 12:42
  • I have downgraded the quanteda version on the problematic machine to 1.5.0 and all the code runs smoothly now. The Twitter dfm which took 25338 seconds to compile on 2.0.1, but just needs 83.2 seconds on 1.5.0. – wldmstr Apr 04 '20 at 13:00
  • The patch changes how tokens() handles URLs so the speed is more or less the same for Facebook posts (unless the contain many links). You can call tokens(x, what = "word1") to use older tokenizer. – Kohei Watanabe Apr 04 '20 at 16:00
  • This is fixed now in the master branch (on GitHub). In the future please file such items as GitHub issues, not SO questions. – Ken Benoit Apr 04 '20 at 17:55
  • I'm voting to close this question as off-topic because it was a bug report, not a How To or programming question. Fixed on GitHub (where it should have been filed as an issue). – Ken Benoit Apr 04 '20 at 17:56

0 Answers0