Consider this example
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2))
# A tibble: 2 x 2
text repetition
<chr> <dbl>
1 a grande latte with soy milk 100
2 black coffee no room 2
The data means the the sentence a grande latte with soy milk
appears 100 times in my dataset. Of course, it is a waste of memory to store that redundancy and this is why I have the repetition
variable.
Still, I would like to have the dtm
from quanteda to reflect that because the sparseness of the dfm gives me some room to keep that information. That is, how can I still have 100 rows for the first text in the dfm? Just using the following code does not take repetition
into account
tibble(text = c('a grande latte with soy milk',
'black coffee no room'),
repetition = c(100, 2)) %>%
corpus() %>%
tokens() %>%
dfm()
Document-feature matrix of: 2 documents, 10 features (50.0% sparse).
2 x 10 sparse Matrix of class "dfm"
features
docs a grande latte with soy milk black coffee no room
text1 1 1 1 1 1 1 0 0 0 0
text2 0 0 0 0 0 0 1 1 1 1