unnest_tokens and keep original columns (tidytext)

Question

The unnest_tokens function of the package tidytext is supposed to keep the other columns of the dataframe (tibble) you pass to it. In the example provided by the authors of the package ("tidy_books" on Austen's data) it works fine, but I get some weird behaviour on these data.

poem1 <- "Tous les poteaux télégraphiques
Viennent là-bas le long du quai
Sur son sein notre République
A mis ce bouquet de muguet"

poem2 <- "La sottise, l'erreur, le péché, la lésine,
Occupent nos esprits et travaillent nos corps,
Et nous alimentons nos aimables remords,
Comme les mendiants nourrissent leur vermine."

poems <- tibble(n_poem = 1:2, text_poem = c(poem1, poem2))

poems <- poems %>% 
  unnest_tokens(output = lines_poem, input = text_poem, token = "lines")

poems <- poems %>% group_by(n_poem) %>% 
  mutate(n_line = row_number())

This makes me lose all columns:

poems %>% unnest_tokens(output = words_poem, input = lines_poem)

The drop option behaves weirdly and brings back the raw text:

poems %>% unnest_tokens(output = words_poem, input = lines_poem, drop = F)

Why is it behaving weirdly? What do you expect as an output? You need to `ungroup()` your data after your n_line mutate, then it would give you the lines_poems. Although, admittedly, it feels weird that it only does after ungrouping. — deschen, Nov 22 '21 at 12:02

score 4 · Accepted Answer · answered Nov 22 '21 at 12:02

You need to ungroup your data. In the argument for collapse, you can see that grouping data automatically collapses the text in each group when not dropping:

Grouping data specifies variables to collapse across in the same way as collapse but you cannot use both the collapse argument and grouped data. Collapsing applies mostly to token options of "ngrams", "skip_ngrams", "sentences", "lines", "paragraphs", or "regex".

I'm assuming this is your expected behaviour:

poems %>%
  ungroup() %>%
  unnest_tokens(output = words_poem, input = lines_poem, drop = F)
#> # A tibble: 48 × 4
#>    n_poem lines_poem                      n_line words_poem    
#>     <int> <chr>                            <int> <chr>         
#>  1      1 tous les poteaux télégraphiques      1 tous          
#>  2      1 tous les poteaux télégraphiques      1 les           
#>  3      1 tous les poteaux télégraphiques      1 poteaux       
#>  4      1 tous les poteaux télégraphiques      1 télégraphiques
#>  5      1 viennent là-bas le long du quai      2 viennent      
#>  6      1 viennent là-bas le long du quai      2 là            
#>  7      1 viennent là-bas le long du quai      2 bas           
#>  8      1 viennent là-bas le long du quai      2 le            
#>  9      1 viennent là-bas le long du quai      2 long          
#> 10      1 viennent là-bas le long du quai      2 du            
#> # … with 38 more rows

unnest_tokens and keep original columns (tidytext)

1 Answers1