I am trying to create a document feature matrix with character-level bigrams in R. The last line of my code takes forever to run and never finishes. The other lines take less than a minute max. I am not sure what to do. Any advice would be appreciated.
Code:
library(quanteda)
#Tokenise corpus by characters
character_level_tokens = quanteda::tokens(corpus,
what = "character",
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_url = T,
remove_separators = T,
split_hyphens = T)
#Convert tokens to characters
character_level_tokens = as.character(character_level_tokens)
#Keep A-Z, a-z letters
character_level_tokens = gsub("[^A-Za-z]","",character_level_tokens)
#Extract character-level bigrams
final_data_char_bigram = char_ngrams(character_level_tokens, n = 2L, concatenator = "")
#Create document-feature matrix (DFM)
dfm.final_data_char_bigram = dfm(final_data_char_bigram)
length(final_data_char_bigram)
[1] 37115571
head(final_data_char_bigram)
[1] "lo" "ov" "ve" "el" "ly" "yt"