Creating document-feature matrix takes very long in R

Question

I am trying to create a document feature matrix with character-level bigrams in R. The last line of my code takes forever to run and never finishes. The other lines take less than a minute max. I am not sure what to do. Any advice would be appreciated.

Code:

library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus, 
                                what = "character",
                                remove_punct = T,
                                remove_symbols = T,
                                remove_numbers = T,
                                remove_url = T,
                                remove_separators = T, 
                                split_hyphens = T)

#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)

#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)

#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")

#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)


length(final_data_char_bigram)
[1] 37115571

head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"

ok; FYR you can do `print(token)` and the output would say something like `namespace:quanteda` or some other package to help verify :) — MichaelChirico, Jun 13 '20 at 15:37
I see `quanteda::dfm` has a `verbose` option, can you try setting `verbose=TRUE` and see if there's any helpful output there? — MichaelChirico, Jun 13 '20 at 15:38
also try `fixed=TRUE` in `dfm` since it looks like you don't need any pattern matching (I may be wrong, please confirm with cross-referencing `?dfm` and your use case) — MichaelChirico, Jun 13 '20 at 15:40
The output with verbose = T shows: "Creating a dfm from a character input... ...lowercasing" — Ana Wilmer, Jun 13 '20 at 15:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215894/discussion-between-ana-wilmer-and-michaelchirico). — Ana Wilmer, Jun 13 '20 at 15:56

score 2 · Accepted Answer · answered Jun 13 '20 at 16:01

I don't have your input corpus or a reproducible example, but here's how to get the result you want. I'd be very surprised if this does not work just fine on your larger corpus too. The first method uses selection and ngram construction in quantead, while the second makes use of the character shingle tokenizer from the tokenizers package.

library("quanteda")
## Package version: 2.0.1

dfm.final_data_char_bigram <- data_corpus_inaugural %>%
  tokens(what = "character") %>%
  tokens_keep("[A-Za-z]", valuetype = "regex") %>%
  tokens_ngrams(n = 2, concatenator = "") %>%
  dfm()

dfm.final_data_char_bigram
## Document-feature matrix of: 58 documents, 545 features (26.4% sparse) and 4 docvars.
##                  features
## docs              fe el ll lo ow wc ci  it  ti iz
##   1789-Washington 20 31 34 12 15  3 29  85 118  5
##   1793-Washington  1  1  7  1  4  1  2   8  12  1
##   1797-Adams      24 52 44 25 24  3 23 160 214  7
##   1801-Jefferson  34 49 60 35 31  7 34  91 116  8
##   1805-Jefferson  26 57 64 27 37  8 34 130 163 11
##   1809-Madison    11 29 37 15 17  1 21  62  82  3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 535 more features ]


# another way

dfm.final_data_char_bigram2 <- data_corpus_inaugural %>%
  tokenizers::tokenize_character_shingles(n = 2) %>%
  as.tokens() %>%
  dfm()

dfm.final_data_char_bigram2
## Document-feature matrix of: 58 documents, 701 features (41.9% sparse).
##                  features
## docs              fe el ll lo ow wc ci  it  ti iz
##   1789-Washington 20 31 34 12 15  3 29  85 118  5
##   1793-Washington  1  1  7  1  4  1  2   8  12  1
##   1797-Adams      24 52 44 25 24  3 23 160 214  7
##   1801-Jefferson  34 49 60 35 31  7 34  91 116  8
##   1805-Jefferson  26 57 64 27 37  8 34 130 163 11
##   1809-Madison    11 29 37 15 17  1 21  62  82  3
## [ reached max_ndoc ... 52 more documents, reached max_nfeat ... 691 more features ]

Creating document-feature matrix takes very long in R

1 Answers1