2

I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location, what's the proper interpretation of a word in each group?

Say in New York the term career is 0.6 and and in Boston the word team is 4.0, how can I interpret these numbers?

    corp=corpus(df,text_field = "What are the areas that need the most improvement at our company?") %>% 
  dfm(remove_numbers=T,remove_punct=T,remove=c(toRemove,stopwords('english')),ngrams=1:2) %>%
  dfm_weight('prop') %>% 
  dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
  dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex")
freq_weight <- textstat_frequency(corp, n = 10, groups = c("location"))


ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
  geom_bar(stat='identity')+
  facet_wrap(~ group, scales = "free") +
  coord_flip() +
  scale_x_continuous(breaks = nrow(freq_weight):1,
                     labels = freq_weight$feature) +
  labs(x = NULL, y = "Relative frequency")
Ted Mosby
  • 1,426
  • 1
  • 16
  • 41

1 Answers1

2

The proper interpretation is that this is the sum of the original term proportions within document, but summed by group. This is not a very natural interpretation, since it sums proportions and you do not know on how many terms the proportion was based (in absolute frequency) before it was summed.

quanteda < 1.4 disallowed this, but following a discussion we enabled it (but let the user beware).

library("quanteda")
#> Package version: 1.4.3
corp <- corpus(c("a b b c c", 
                 "a a b", 
                 "b b c",
                 "c c c d"),
               docvars = data.frame(grp = c(1, 1, 2, 2)))
dfmat <- dfm(corp) %>%
    dfm_weight(scheme = "prop")
dfmat
#> Document-feature matrix of: 4 documents, 4 features (43.8% sparse).
#> 4 x 4 sparse Matrix of class "dfm"
#>        features
#> docs            a         b         c    d
#>   text1 0.2000000 0.4000000 0.4000000 0   
#>   text2 0.6666667 0.3333333 0         0   
#>   text3 0         0.6666667 0.3333333 0   
#>   text4 0         0         0.7500000 0.25

Now we can compare the textstat_frequency() with and without groups. (Neither makes too much sense.)

# sum across the corpus
textstat_frequency(dfmat, groups = NULL)
#>   feature frequency rank docfreq group
#> 1       c 1.4833333    1       3   all
#> 2       b 1.4000000    2       3   all
#> 3       a 0.8666667    3       2   all
#> 4       d 0.2500000    4       1   all

# sum across groups
textstat_frequency(dfmat, groups = "grp")
#>   feature frequency rank docfreq group
#> 1       a 0.8666667    1       2     1
#> 2       b 0.7333333    2       2     1
#> 3       c 0.4000000    3       1     1
#> 4       c 1.0833333    1       2     2
#> 5       b 0.6666667    2       1     2
#> 6       d 0.2500000    3       1     2

If what you wanted was the relative term frequencies after grouping, then you can first group the dfm and then weight it, like this:

dfmat2 <- dfm(corp) %>%
    dfm_group(groups = "grp") %>%
    dfm_weight(scheme = "prop")

textstat_frequency(dfmat2, groups = "grp")
#>   feature frequency rank docfreq group
#> 1       a 0.3750000    1       1     1
#> 2       b 0.3750000    1       1     1
#> 3       c 0.2500000    3       1     1
#> 4       c 0.5714286    1       1     2
#> 5       b 0.2857143    2       1     2
#> 6       d 0.1428571    3       1     2

Now, the term frequencies sum to 1.0 within group, making their interpretation more natural because they were computed on grouped counts, not grouped proportions.

Ken Benoit
  • 14,454
  • 27
  • 50
  • Ah okay, that makes sense. so if someone asks how they should read that 6 or 0.4, it's how much that term appears in the group compared to all groups, or am I reading that wrong? – Ted Mosby Jul 02 '19 at 19:12
  • What I'm also finding is that if I weight them before grouping (the way you say is inaccurate and I agree), I'm getting far more descriptive top terms. If I group then weight, i'm getting terms like "now","like","good", etc. but doing it inverse I'm getting terms that can tell a better story ("leadership","clear_communication", etc.) I'm not sure what to make of this either. Any guidance? Thanks! – Ted Mosby Jul 03 '19 at 13:16
  • 1
    group -> weight -> frequency shows you the most frequent terms within group. weight -> group -> frequency shows you the sums of the relative weights within group, so the words with the highest scores will be the words that occur with the highest regularity of being high proportion words in each document. The exact interpretation will depend entirely on your context and the nature (and lengths) of your documents. – Ken Benoit Jul 03 '19 at 16:48