Why does featnames(myDFM) contain features of more than one or two tokens?

Question

I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it:

library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
                  remove = stopwords("english"),
                  #what = "word", #experimented if adding this made a difference
                  remove_punct = T,
                  remove_numbers = T,
                  remove_symbols = T,
                  ngrams = 1:2,
                  dictionary = lut_dict,
                  stem = TRUE)

Then to look at the resulting features:

dimnames(corpus_dfm)$features
[1] "abandon"                                      
[2] "abandoned auto"                               
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri"

Why are these features more than 1:2 bigrams long? Stemming appears to have been successful, but the tokens appear to be sentences and not words.

I tried adjusting my code to this: dfm(tokens(corpus1M, what = "word") but there was no change.

I tried to make a tiny reproducible example:

library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
                  "I like carrots",
                  "the there that etc cats dogs") %>% corpus

Then if I apply the same dfm as above:

> dimnames(corpus_dfm)$features
[1] "etc."

This was surprising because nearly all word have been removed? Even stopwords unlike before, so I'm more confused! I'm also now not able to create a reproducible example despite just trying to. Maybe I've misunderstood how this function works?

How can I create a dfm in quanteda where there are only 1:2 word tokens and where stopwords are removed?

That's not what I get from your example, but I cannot apply the dictionary without knowing what it is. If you supply the dictionary I will complete my answer. — Ken Benoit, Aug 30 '17 at 15:53
Hi @KenBenoit I have added the dictionary via dput on the post now — Doug Fir, Aug 30 '17 at 16:23
**quanteda** dictionaries are S4 objects and the `dput` drops some of their attributes, such as the names in the list. Can you save it using `save` or `saveRDS`, and post a link to the file on Dropbox or some other stored location? — Ken Benoit, Aug 30 '17 at 16:32

score 1 · Accepted Answer · answered Aug 30 '17 at 17:05

First question: Why are the feature (names) in the dfm so long?

Answer: Because the application of the dictionary in the dfm() call replaces the matches to your unigrams and bigram features with the dictionary keys, and (many of) the keys in your dictionary consist of multiple words. Example:

lut_dict[70:72]
# Dictionary object with 3 key entries.
# - assault felony:
#     - asf
# - assault misdemeanor:
#     - asm
# - assault no weapon aggravated injury:
#     - anai

Second question: In reproducible example, why are almost all words gone?

Answer: Because the only match of a dictionary value to the features in the dfm was to the "etc." category.

corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
                  remove = stopwords("english"),
                  remove_punct = TRUE,
                  remove_numbers = TRUE,
                  remove_symbols = TRUE,
                  dictionary = lut_dict,
                  ngrams = 1:2,
                  stem = TRUE, verbose = TRUE)
corpus_dfm2
# Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
# 3 x 1 sparse Matrix of class "dfmSparse"
#        features
# docs    etc.
#   text1    0
#   text2    0
#   text3    1

lut_dict["etc."]
# Dictionary object with 1 key entry.
# - etc.:
#     - etc

If you do not apply the dictionary, then you see:

dfm(tokens(example_text),   # the "tokens" is not necessary here
    remove = stopwords("english"),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE,
    ngrams = 1:2,
    stem = TRUE)
# Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
# 3 x 18 sparse Matrix of class "dfmSparse"
#        features
# docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
#   text1     1     1   1         1           1         1    0      0      0
#   text2     0     0   0         0           0         0    1      1      1
#   text3     0     0   0         0           0         0    0      0      0
#        features
# docs    like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
#   text1           0   0   0   0         0          0        0       0       0
#   text2           1   0   0   0         0          0        0       0       0
#   text3           0   1   1   1         1          1        1       1       1

If you want to keep the features not matched, then replace dictionary with thesaurus. Below, you will see that the "etc" token has been replaced with the upper-cased key "ETC.":

dfm(tokens(example_text), 
    remove = stopwords("english"),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE,
    thesaurus = lut_dict,
    ngrams = 1:2,
    stem = TRUE)
Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
3 x 18 sparse Matrix of class "dfmSparse"
       features
docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
  text1     1     1   1         1           1         1    0      0      0
  text2     0     0   0         0           0         0    1      1      1
  text3     0     0   0         0           0         0    0      0      0
       features
docs    like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
  text1           0   0   0         0          0        0       0       0    0
  text2           1   0   0         0          0        0       0       0    0
  text3           0   1   1         1          1        1       1       1    1

Why does featnames(myDFM) contain features of more than one or two tokens?

1 Answers1