0

step_tokenize returns a vector of type tknlist. How can I get a rectangular for of it? I mean something like unnesting the tokens and add them a cols of the tibble.

library(textrecipes)
library(modeldata)
data(tate_text)

tate_rec <- recipe(~., data = tate_text) %>%
  step_tokenize(medium) %>% 
  show_tokens(medium)

tate_obj <- tate_rec %>%
  prep()

dd <- bake(tate_obj, new_data = NULL, medium)
Nip
  • 387
  • 4
  • 11

1 Answers1

2

There is not a direct way to get a rectangle from a tknlist object in {textrecipes}. The object is mainly used internally in the package.

You can use the unexported textrecipes:::get_tokens() function to turn the tknlist object into a list of character tokens. But the package doesn't have any functions that let unnest that object.

library(textrecipes)
library(modeldata)
data(tate_text)

tate_rec <- recipe(~., data = tate_text) %>%
  step_tokenize(medium)

tate_obj <- tate_rec %>%
  prep()

dd <- bake(tate_obj, new_data = NULL, medium, everything())

dd
#> # A tibble: 4,284 × 5
#>        medium     id artist             title                               year
#>     <tknlist>  <dbl> <fct>              <fct>                              <dbl>
#>  1 [8 tokens]  21926 Absalon            Proposals for a Habitat             1990
#>  2 [3 tokens]  20472 Auerbach, Frank    Michael                             1990
#>  3 [3 tokens]  20474 Auerbach, Frank    Geoffrey                            1990
#>  4 [3 tokens]  20473 Auerbach, Frank    Jake                                1990
#>  5 [4 tokens]  20513 Auerbach, Frank    To the Studios                      1990
#>  6 [4 tokens]  21389 Ayres, OBE Gillian Phaëthon                            1990
#>  7 [4 tokens] 121187 Barlow, Phyllida   Untitled                            1990
#>  8 [3 tokens]  19455 Baselitz, Georg    Green VIII                          1990
#>  9 [6 tokens]  20938 Beattie, Basil     Present Bound                       1990
#> 10 [3 tokens] 105941 Beuys, Joseph      Joseph Beuys: A Private Collectio…  1990
#> # … with 4,274 more rows

dd %>%
  mutate(medium = textrecipes:::get_tokens(medium))
#> # A tibble: 4,284 × 5
#>    medium        id artist             title                                year
#>    <list>     <dbl> <fct>              <fct>                               <dbl>
#>  1 <chr [8]>  21926 Absalon            Proposals for a Habitat              1990
#>  2 <chr [3]>  20472 Auerbach, Frank    Michael                              1990
#>  3 <chr [3]>  20474 Auerbach, Frank    Geoffrey                             1990
#>  4 <chr [3]>  20473 Auerbach, Frank    Jake                                 1990
#>  5 <chr [4]>  20513 Auerbach, Frank    To the Studios                       1990
#>  6 <chr [4]>  21389 Ayres, OBE Gillian Phaëthon                             1990
#>  7 <chr [4]> 121187 Barlow, Phyllida   Untitled                             1990
#>  8 <chr [3]>  19455 Baselitz, Georg    Green VIII                           1990
#>  9 <chr [6]>  20938 Beattie, Basil     Present Bound                        1990
#> 10 <chr [3]> 105941 Beuys, Joseph      Joseph Beuys: A Private Collection…  1990
#> # … with 4,274 more rows
EmilHvitfeldt
  • 2,555
  • 1
  • 9
  • 12