R: using strsplit in a tibble column

Question

I have a tibble and in one of the columns are strings. They are names of instruments that respondents to a survey said they played. I would like to capture each instrument as it's own separate string. Values in this column range from one single string like guitar, to more complex answers: Sing, Drums/Percussion, Piano/Keyboard... etc I have tried something like this:

options <- strsplit(survey$instruments_list, "\\, | \\/ | ")

unfortunately the output still has the / character between several strings.

also as a final problem, one of the respondents answered with an incredibly long winded answer separated by multiple spaces and I just want the instrument, not their life story.

any suggestions will be greatly appreciated, thank you! edit: results of dput(head(survey))

structure(list(time_submitted = c("8/27/19 20:22", "8/29/19 12:15", 
"8/28/19 19:33", "8/29/19 16:25", "8/27/19 15:40", "8/27/19 22:59"
), pseudonym_generator = c("Fake rapper name generator", "Fake band name generator", 
"Fake band name generator", "Fake band name generator", "Fake band name generator", 
"Fake band name generator"), pseudonym = c("Lord Los Angeles", 
"Heroes War", "Puppets War", "West Magic", "Eller Angel", "Trace Stripes"
), sex = c("Male", "Male", "Male", "Male", "Male", "Male"), academic_major = c("Computer Science", 
"Computer Science", "Math", "Computer Science", "Computer Science", 
"Computer Science"), academic_level = c("Senior", "Junior", "Senior", 
"Junior", "Senior", "Senior"), year_born = c(1994, 1997, 1996, 
1999, 1998, 1986), instrument_list = c("Rap", "Guitar", "Guitar", 
"Trumpet", "Piano/Keyboards, Ukulele", NA), favorite_song_artist = c("40 crew", 
"Arctic Monkeys", "Avatar", "Ben Folds", "blink-182", "brian jonestown massacre / sarabeth tucek"
), favorite_song = c("Not Enough", "Arabella", "The Eagle Has Landed", 
"Still", "She's Out Of Her Mind", "Seer"), favorite_song_link = c("https://www.youtube.com/watch?v=uITuGZKljgQ", 
"https://www.youtube.com/watch?v=Jn6-TItCazo", "https://www.youtube.com/watch?v=4p6GWewmTYQ", 
"https://www.youtube.com/watch?v=ShBzUK4rnI8", "https://www.youtube.com/watch?v=krpm0v_486k", 
"https://youtu.be/C-XT7DZsNP8")), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L))

I'm sorry, I'm not sure what you are asking. I am pretty new to R. — Jay Wehrman, Oct 06 '19 at 01:00
so you want to `instrument_list` into different columns or rows ? — Ronak Shah, Oct 06 '19 at 01:21
sorry I should have declared my end goal. I am going to make a new tibble, with the columns: pseudonym and then a column for each instrument listed in instruments_list. So I want all the instruments given in instruments_list to have their own column and then an observation of either plays or does not play, something like that, for each pseudonym. — Jay Wehrman, Oct 06 '19 at 01:24
Do you mean something like `strsplit(survey$instrument_list, "[,/]")`? — r2evans, Oct 06 '19 at 01:30

r2evans · Accepted Answer · 2019-10-11T17:31:24.857

How about this:

library(dplyr)
library(tidyr)
survey %>%
  transmute(pseudonym, inst = strsplit(instrument_list, "[,/]")) %>%
  filter(!is.na(inst)) %>%
  unnest() %>%
  mutate(inst = trimws(inst), plays = TRUE) %>%
  spread(inst, plays) %>%
  mutate_at(vars(-pseudonym), Negate(is.na))
# # A tibble: 5 x 7
#   pseudonym        Guitar Keyboards Piano Rap   Trumpet Ukulele
#   <chr>            <lgl>  <lgl>     <lgl> <lgl> <lgl>   <lgl>  
# 1 Eller Angel      FALSE  TRUE      TRUE  FALSE FALSE   TRUE   
# 2 Heroes War       TRUE   FALSE     FALSE FALSE FALSE   FALSE  
# 3 Lord Los Angeles FALSE  FALSE     FALSE TRUE  FALSE   FALSE  
# 4 Puppets War      TRUE   FALSE     FALSE FALSE FALSE   FALSE  
# 5 West Magic       FALSE  FALSE     FALSE FALSE TRUE    FALSE

It's not hard to combine multiple instruments into single categories. I'll adapt one of your instruments to demonstrate.

One method is with case_when, perhaps the more direct/literal of the two methods:

survey %>%
  mutate(instrument_list = if_else(grepl("Lord", pseudonym), "Electric Guitar", instrument_list)) %>%
  transmute(pseudonym, inst = strsplit(instrument_list, "[,/]")) %>%
  filter(!is.na(inst)) %>%
  unnest() %>%
  mutate(inst = trimws(inst), plays = TRUE) %>%
  spread(inst, plays) %>%
  mutate_at(vars(-pseudonym), Negate(is.na))
# # A tibble: 5 x 7
#   pseudonym        `Electric Guitar` Guitar Keyboards Piano Trumpet Ukulele
#   <chr>            <lgl>             <lgl>  <lgl>     <lgl> <lgl>   <lgl>  
# 1 Eller Angel      FALSE             FALSE  TRUE      TRUE  FALSE   TRUE   
# 2 Heroes War       FALSE             TRUE   FALSE     FALSE FALSE   FALSE  
# 3 Lord Los Angeles TRUE              FALSE  FALSE     FALSE FALSE   FALSE  
# 4 Puppets War      FALSE             TRUE   FALSE     FALSE FALSE   FALSE  
# 5 West Magic       FALSE             FALSE  FALSE     FALSE TRUE    FALSE  

survey %>%
  mutate(instrument_list = if_else(grepl("Lord", pseudonym), "Electric Guitar", instrument_list)) %>%
  transmute(pseudonym, inst = strsplit(instrument_list, "[,/]")) %>%
  filter(!is.na(inst)) %>%
  unnest() %>%
  mutate(
    inst = case_when(
      grepl("\\bPiano\\b", inst, ignore.case = TRUE) ~ "Piano",
      grepl("\\bUkelete\\b", inst, ignore.case = TRUE) ~ "Ukelele",
      grepl("\\bGuitar\\b", inst, ignore.case = TRUE) ~ "Guitar",
      TRUE ~ trimws(inst)),
    plays = TRUE,
  ) %>%
  spread(inst, plays) %>%
  mutate_at(vars(-pseudonym), Negate(is.na))
# # A tibble: 5 x 6
#   pseudonym        Guitar Keyboards Piano Trumpet Ukulele
#   <chr>            <lgl>  <lgl>     <lgl> <lgl>   <lgl>  
# 1 Eller Angel      FALSE  TRUE      TRUE  FALSE   TRUE   
# 2 Heroes War       TRUE   FALSE     FALSE FALSE   FALSE  
# 3 Lord Los Angeles TRUE   FALSE     FALSE FALSE   FALSE  
# 4 Puppets War      TRUE   FALSE     FALSE FALSE   FALSE  
# 5 West Magic       FALSE  FALSE     FALSE TRUE    FALSE

Another method (if you have many more) is to merge/join in a frame. One advantage to this is that it can be very specific and include very-different instruments (where regex might be more than you want to handle). One disadvantage is that it can be too specific ... it does not join in mis-spellings or case differences, for instance.

gen_inst <- tibble::tribble(
  ~inst, ~newinst
 ,"Electric Guitar", "Guitar"
 ,"Electric Bass"  , "Guitar"
 ,"Electric Piano" , "Piano"
 ,"Pipe Organ"     , "Piano"
)

survey %>%
  mutate(instrument_list = if_else(grepl("Lord", pseudonym), "Electric Guitar", instrument_list)) %>%
  transmute(pseudonym, inst = strsplit(instrument_list, "[,/]")) %>%
  filter(!is.na(inst)) %>%
  unnest() %>%
  left_join(gen_inst, by = "inst") %>%
  mutate(
    inst = if_else(is.na(newinst), trimws(inst), newinst),
    plays = TRUE
  ) %>%
  select(-newinst) %>%
  spread(inst, plays) %>%
  mutate_at(vars(-pseudonym), Negate(is.na))
# # A tibble: 5 x 6
#   pseudonym        Guitar Keyboards Piano Trumpet Ukulele
#   <chr>            <lgl>  <lgl>     <lgl> <lgl>   <lgl>  
# 1 Eller Angel      FALSE  TRUE      TRUE  FALSE   TRUE   
# 2 Heroes War       TRUE   FALSE     FALSE FALSE   FALSE  
# 3 Lord Los Angeles TRUE   FALSE     FALSE FALSE   FALSE  
# 4 Puppets War      TRUE   FALSE     FALSE FALSE   FALSE  
# 5 West Magic       FALSE  FALSE     FALSE TRUE    FALSE

thank you! I really liked this one as it included some functions I had seen before in a previous statistics class. — Jay Wehrman, Oct 06 '19 at 05:32
if I could ask on more question, I have two columns for ukulele (Ukulele and Elementary Ukulele) and piano (piano and a very long answer involving piano) I would like to merge those two columns for Ukulele into one Ukulele column, same regarding piano. I have tried unite and unite_ but neither compiled. thank you — Jay Wehrman, Oct 09 '19 at 04:07
In the `mutate` after `unnest`, try something like `inst = if_else(grepl("Piano", inst), "Piano", inst)` and another for ukelele. — r2evans, Oct 09 '19 at 13:23

Ronak Shah · Answer 2 · 2019-10-06T01:50:04.017

We can also use cSplit_e from splitstackshape

output <- splitstackshape::cSplit_e(survey, "instrument_list", type = "character", 
                 fill = 0, sep=",|/", fixed = FALSE)

output[12:17]
#  instrument_list_Guitar instrument_list_Keyboards instrument_list_Piano
#1                      0                         0                     0
#2                      1                         0                     0
#3                      1                         0                     0
#4                      0                         0                     0
#5                      0                         1                     1
#6                      0                         0                     0

#  instrument_list_Rap instrument_list_Trumpet instrument_list_Ukulele
#1                   1                       0                       0
#2                   0                       0                       0
#3                   0                       0                       0
#4                   0                       1                       0
#5                   0                       0                       1
#6                   0                       0                       0

Here 1 in the column represents the instrument is played and 0 means it is not played.

these are both great techniques!! thank you so much!! – Jay Wehrman Oct 06 '19 at 02:15 — Jay Wehrman, Oct 06 '19 at 02:15

score 0 · Answer 3 · answered Oct 06 '19 at 15:59

Not wildly different from the other answers, but here's one using some tidyr conveniences. separate_rows splits strings and unnests them in one call; if you include an optional \\s in the regex, spaces in ", " will be included in the separator so you can skip trimming whitespace. Adding a dummy variable gives a value to populate the instrument columns, and the NAs get filled in with 0.

library(dplyr)
library(tidyr)

survey_wide <- survey %>%
  select(pseudonym, instrument_list) %>%
  separate_rows(instrument_list, sep = "(\\,|\\/)\\s?") %>%
  filter(!is.na(instrument_list)) %>%
  mutate(dummy = 1) %>%
  spread(key = instrument_list, value = dummy, fill = 0) 

survey_wide
#> # A tibble: 5 x 7
#>   pseudonym        Guitar Keyboards Piano   Rap Trumpet Ukulele
#>   <chr>             <dbl>     <dbl> <dbl> <dbl>   <dbl>   <dbl>
#> 1 Eller Angel           0         1     1     0       0       1
#> 2 Heroes War            1         0     0     0       0       0
#> 3 Lord Los Angeles      0         0     0     1       0       0
#> 4 Puppets War           1         0     0     0       0       0
#> 5 West Magic            0         0     0     0       1       0

An additional step if you need booleans instead of numbers:

survey_wide %>%
  mutate_at(vars(-pseudonym), as.logical)
#> # A tibble: 5 x 7
#>   pseudonym        Guitar Keyboards Piano Rap   Trumpet Ukulele
#>   <chr>            <lgl>  <lgl>     <lgl> <lgl> <lgl>   <lgl>  
#> 1 Eller Angel      FALSE  TRUE      TRUE  FALSE FALSE   TRUE   
#> 2 Heroes War       TRUE   FALSE     FALSE FALSE FALSE   FALSE  
#> 3 Lord Los Angeles FALSE  FALSE     FALSE TRUE  FALSE   FALSE  
#> 4 Puppets War      TRUE   FALSE     FALSE FALSE FALSE   FALSE  
#> 5 West Magic       FALSE  FALSE     FALSE FALSE TRUE    FALSE

R: using strsplit in a tibble column

3 Answers3