The unnest_tokens
function of the package tidytext
is supposed to keep the other columns of the dataframe (tibble) you pass to it. In the example provided by the authors of the package ("tidy_books" on Austen's data) it works fine, but I get some weird behaviour on these data.
poem1 <- "Tous les poteaux télégraphiques
Viennent là-bas le long du quai
Sur son sein notre République
A mis ce bouquet de muguet"
poem2 <- "La sottise, l'erreur, le péché, la lésine,
Occupent nos esprits et travaillent nos corps,
Et nous alimentons nos aimables remords,
Comme les mendiants nourrissent leur vermine."
poems <- tibble(n_poem = 1:2, text_poem = c(poem1, poem2))
poems <- poems %>%
unnest_tokens(output = lines_poem, input = text_poem, token = "lines")
poems <- poems %>% group_by(n_poem) %>%
mutate(n_line = row_number())
This makes me lose all columns:
poems %>% unnest_tokens(output = words_poem, input = lines_poem)
The drop option behaves weirdly and brings back the raw text:
poems %>% unnest_tokens(output = words_poem, input = lines_poem, drop = F)