I am trying to create trigrams and bigrams from a large (1GB) text file using the 'quanteda' package in the R programming environment. If I try and run my code in one go (as below) R just hangs (on the 3rd line - myCorpus<-toLower(...)). I used the code successfully on a small dataset <1mb, so I guess the file is too large. I can see I perhaps need to load the text in 'chunks' and combine the resulting frequencies of bigrams and trigrams afterwards. But I cannot work out how to load and process the text in manageable 'chunks'. Any advice on an approach to this problem would be very welcome. My code is pasted below. Any suggestions for other approaches for improving my code would be also welcome.
folder.dataset.english <- 'final/corpus'
myCorpus <- corpus(x=textfile(list.files(path = folder.dataset.english, pattern = "\\.txt$", full.names = TRUE, recursive = FALSE))) # build the corpus
myCorpus<-toLower(myCorpus, keepAcronyms = TRUE)
#bigrams
bigrams<-dfm(myCorpus, ngrams = 2,verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,removeTwitter = TRUE, stem = FALSE)
bigrams_freq<-sort(colSums(bigrams),decreasing=T)
bigrams<-data.frame(names=names(bigrams_freq),freq=bigrams_freq,stringsAsFactors =FALSE)
bigrams$first<- sapply(strsplit(bigrams$names, "_"), "[[", 1)
bigrams$last<- sapply(strsplit(bigrams$names, "_"), "[[", 2)
rownames(bigrams)<-NULL
bigrams.freq.freq<-table(bigrams$freq)
saveRDS(bigrams,"dictionaries/bigrams.rds")
#trigrams
trigrams<-dfm(myCorpus, ngrams = 3,verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = TRUE, stem = FALSE)
trigrams_freq<-sort(colSums(trigrams),decreasing=T)
trigrams<-data.frame(names=names(trigrams_freq),freq=trigrams_freq,stringsAsFactors =FALSE)
trigrams$first<-paste(sapply(strsplit(trigrams$names, "_"), "[[", 1),sapply(strsplit(trigrams$names, "_"), "[[", 2),sep="_")
trigrams$last<-sapply(strsplit(trigrams$names, "_"), "[[", 3)
rownames(trigrams)<-NULL
saveRDS(trigrams,"dictionaries/trigrams.rds")