5

I am running into some problems doing text processing using dplyr and stringr functions (specifically str_split()). I think I am misunderstanding something very fundamental about how to use dplyr correctly when dealing with elements that are vectors/lists.

Here's a tibble, df...

library(tidyverse)

df <- tribble(
  ~item, ~phrase,
  "one",   "romeo and juliet",
  "two",   "laurel and hardy",
  "three", "apples and oranges and pears and peaches"
)

Now I create a new column, splitPhrase, by doing str_split() on one of the columns using "and" as the delimiter.

df <- df %>%
      mutate(splitPhrase = str_split(phrase,"and")) 

That seems to work, sort-of, in RStudio I see this...

enter image description here

In the console I see that my new column, splitPhrase, is actually composed of list... but it looks correct in the Rstudio display, right?

df
#> # A tibble: 3 x 3
#>   item  phrase                                   splitPhrase
#>   <chr> <chr>                                    <list>     
#> 1 one   romeo and juliet                         <chr [2]>  
#> 2 two   laurel and hardy                         <chr [2]>  
#> 3 three apples and oranges and pears and peaches <chr [4]>

What I ultimately want to do is to extract the last item of each splitPhrase. In other words, I'd like to get to this...

enter image description here

The problem is I can't see how to just grab the last element in each splitPhrase. If it were just a vector, I could do something like this...

#> last( c("a","b","c") )
#[1] "c"
#> 

But that doesn't work within the tibble, neither does other things that come to mind:

df <- df %>% 
       mutate(lastThing = last(splitPhrase))
# Error in mutate_impl(.data, dots) : 
#   Column `lastThing` must be length 3 (the number of rows) or one, not 4

df <- df %>% group_by(splitPhrase) %>%
  mutate(lastThing = last(splitPhrase))
# Error in grouped_df_impl(data, unname(vars), drop) : 
#  Column `splitPhrase` can't be used as a grouping variable because it's a list

So, I think I am "not getting" how to work with vectors that are inside an element in table/tibble column. It seems to have something to do with the fact that in my example it's actually a list of vectors.

Is there a particular function that will help me out here, or a better way of getting to this?

Created on 2018-09-27 by the reprex package (v0.2.1)

Angelo
  • 2,936
  • 5
  • 29
  • 44

2 Answers2

2

The 'splitPhrase' column is a list, so we loop through the list to get the elements

library(tidyverse)
df %>% 
   mutate(splitPhrase = str_split(phrase,"\\s*and\\s*"),
          Last = map_chr(splitPhrase, last)) %>%
   select(item, Last)

But, it can be done in many ways. Using separate_rows, expand the column, then get last element grouped by 'item'

df %>% 
  separate_rows(phrase,sep = " and ") %>% 
  group_by(item) %>% 
  summarise(Last = last(phrase))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! So a "List Column" doesn't display differently than a regular column in the RStudio preview tab? – Angelo Sep 27 '18 at 19:31
  • @Angelo If you look at the console output, it says the type as `list` – akrun Sep 27 '18 at 19:34
  • 1
    I think my fundamental misunderstanding was with the concept of "List Columns". Your answer solved my problem! Also found this to be useful for background: https://www.rstudio.com/resources/videos/how-to-work-with-list-columns/ – Angelo Sep 27 '18 at 21:04
1

Haven't tested for efficiency, but we can also use regex to extract the string segment after the last "and":

With sub:

library(dplyr)
df %>%
  mutate(lastThing = sub("^.*and\\s", "", phrase)) %>%
  select(-phrase)

With str_extract:

library(stringr)
df %>%
  mutate(lastThing = str_extract(phrase, "(?<=and\\s)\\w+$")) %>%
  select(-phrase)

With extract:

library(tidyr)
df %>%
  extract(phrase, "lastThing", "^.*and\\s(\\w+)")

Output:

# A tibble: 3 x 2
  item  lastThing
  <chr> <chr>    
1 one   juliet   
2 two   hardy    
3 three peaches
acylam
  • 18,231
  • 5
  • 36
  • 45
  • thanks! I normally do use regexes with str_extract(), and that's my first choice! But I had to make a minimal example to show the core problem I was having with "List Columns" – Angelo Sep 27 '18 at 21:00