how to use quanteda on aggregated data?

Question

Consider this example

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) 
# A tibble: 2 x 2
  text                         repetition
  <chr>                             <dbl>
1 a grande latte with soy milk        100
2 black coffee no room                  2

The data means the the sentence a grande latte with soy milk appears 100 times in my dataset. Of course, it is a waste of memory to store that redundancy and this is why I have the repetition variable.

Still, I would like to have the dtm from quanteda to reflect that because the sparseness of the dfm gives me some room to keep that information. That is, how can I still have 100 rows for the first text in the dfm? Just using the following code does not take repetition into account

tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2)) %>% 
  corpus() %>% 
  tokens() %>% 
  dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room
  text1 1      1     1    1   1    1     0      0  0    0
  text2 0      0     0    0   0    0     1      1  1    1

score 2 · Accepted Answer · answered Feb 15 '19 at 19:12

Supposing your data.frame is called df1, you can use cbind to add a column to the dfm. But that might not give you the required result. The other two options below are probably better.

cbind

df1 <- tibble(text = c('a grande latte with soy milk',
                'black coffee no room'),
       repetition = c(100, 2))

my_dfm <- df1 %>%  
  corpus() %>% 
  tokens() %>% 
  dfm() %>% 
  cbind(repetition = df1$repetition) # add column to dfm with name repetition

Document-feature matrix of: 2 documents, 11 features (45.5% sparse).
2 x 11 sparse Matrix of class "dfm"
       features
docs    a grande latte with soy milk black coffee no room repetition
  text1 1      1     1    1   1    1     0      0  0    0        100
  text2 0      0     0    0   0    0     1      1  1    1          2

docvars

You can also add data via the docvars function, then the data is added to the dfm but a bit more hidden in the dfm-class slots (reachable with @).

docvars(my_dfm, "repetition") <- df1$repetition
docvars(my_dfm)

      repetition
text1        100
text2          2

multiplication

Using multiplication:

my_dfm * df1$repetition

Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
       features
docs      a grande latte with soy milk black coffee no room
  text1 100    100   100  100 100  100     0      0  0    0
  text2   0      0     0    0   0    0     2      2  2    2

thank you this is very nice. Is there a way to actually repeat the first row in the `dfm` 100 times? that is, creating a `text1_1`, `text1_2`, .. `text1_100` that contains the same columns values as `text1` in the original dfm? In a sense, duplicating the rows of the dfm directly instead of adding a column or multiplying the inputs — ℕʘʘḆḽḘ, Feb 15 '19 at 19:32

score 1 · Answer 2 · answered Feb 16 '19 at 03:34

You could use indexing to get the repetition you want, while maintaining the efficiency of just having the single texts.

library("tibble")
library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

tib <- tibble(
  text = c(
    "a grande latte with soy milk",
    "black coffee no room"
  ),
  repetition = c(100, 2)
)
dfmat <- corpus(tib) %>%
  dfm()

Define a function to repeat your "repetition" variable:

repindex <- function(x) rep(seq_along(x), times = x)

Then just repeat the indexing of the two-document dfm:

dfmat2 <- dfmat[repindex(tib$repetition), ]
dfmat2
## Document-feature matrix of: 102 documents, 10 features (40.4% sparse).

head(dfmat2, 2)
## Document-feature matrix of: 2 documents, 10 features (40.0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
##        features
## docs    a grande latte with soy milk black coffee no room
##   text1 1      1     1    1   1    1     0      0  0    0
##   text1 1      1     1    1   1    1     0      0  0    0
tail(dfmat2, 4)
## Document-feature matrix of: 4 documents, 10 features (50.0% sparse).
## 4 x 10 sparse Matrix of class "dfm"
##        features
## docs    a grande latte with soy milk black coffee no room
##   text1 1      1     1    1   1    1     0      0  0    0
##   text1 1      1     1    1   1    1     0      0  0    0
##   text2 0      0     0    0   0    0     1      1  1    1
##   text2 0      0     0    0   0    0     1      1  1    1

actually just a follow up, if the `repetition` is actually stored as a docvars, I guess I can simply use `dfmat[repindex(docvars(dfmat, 'repetition')), ]` ... would the correct ordering of the variables be preserved in that case? — ℕʘʘḆḽḘ, Feb 25 '19 at 14:38
Yes, absolutely. The `corpus(tib)` automatically reads the `repetition` column of your tibble as a docvar, which is passed on to `dfmat`, so that will work as is on `dfmat`. — Ken Benoit, Feb 25 '19 at 20:27

how to use quanteda on aggregated data?

2 Answers2