I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it:
library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
#what = "word", #experimented if adding this made a difference
remove_punct = T,
remove_numbers = T,
remove_symbols = T,
ngrams = 1:2,
dictionary = lut_dict,
stem = TRUE)
Then to look at the resulting features:
dimnames(corpus_dfm)$features
[1] "abandon"
[2] "abandoned auto"
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri"
Why are these features more than 1:2 bigrams long? Stemming appears to have been successful, but the tokens appear to be sentences and not words.
I tried adjusting my code to this: dfm(tokens(corpus1M, what = "word")
but there was no change.
I tried to make a tiny reproducible example:
library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
"I like carrots",
"the there that etc cats dogs") %>% corpus
Then if I apply the same dfm as above:
> dimnames(corpus_dfm)$features
[1] "etc."
This was surprising because nearly all word have been removed? Even stopwords unlike before, so I'm more confused! I'm also now not able to create a reproducible example despite just trying to. Maybe I've misunderstood how this function works?
How can I create a dfm in quanteda where there are only 1:2 word tokens and where stopwords are removed?