String splitting a dataframe with a vector as the pattern in R

Question

I have a dataframe that consists of multiple rows, and I would like to split every row into two components based off of elements of a vector (essentially run strsplit with a vector as the 'pattern') in R.

The dataframe (only one column) looks something like this:

     [,1]                
[1,] "apple please fuji" 
[2,] "pear help name"    
[3,] "banana me mango"

Whereas my pattern vector could look like this: v <- c("please", "help", "me").

If possible, I would like my end output to be:

  df$name             df$part1  df$split  df$part2   
 "apple please fuji" "apple"    "please"  "fuji" 
 "pear help name"    "pear"     "help"    "name" 
 "banana me mango"   "banana"   "me"      "mango"

I would appreciate any help with the in-between step of being able to isolate components based on a vector, but if there is an even easier way to put it into a dataframe, that would be great!. Thank you so much!

If first row was `"red apple please fuji" `, would result be `c("red apple", "please", "fuji")` ? — zx8754, Oct 05 '17 at 10:32
@zx8754, yup, that's what I'm hoping for! I would just like to isolate three categories (regardless of number of words): before the dividing string, the dividing string, and after the dividing string. Thanks! — maria, Oct 06 '17 at 03:27

lmo · Accepted Answer · 2017-10-05T13:10:46.040

Here are two methods in base R.

Start with a character vector:

text <- c("apple please fuji", "pear help name", "banana me mango")

Also, the desired variable names (for convenience)

varNames <- c("name", "part1", "split", "part2")

using regexec and regmatches

As an alternative, you can also use regular expressions with the regmatches / regexec combination to construct this dataset.

First, build a regular expression from v with paste.

myRegex <- paste0("^(.*) +(", paste(v, collapse="|"), ") +(.*)$")
myRegex
[1] "^(.*)(please|help|me)(.*)$"

setNames(do.call(rbind.data.frame, regmatches(text, regexec(myRegex, text))), varNames)

this returns the same as above

               name  part1  split part2
1 apple please fuji  apple please  fuji
2    pear help name   pear   help  name
3   banana me mango banana     me mango

using strsplit and do.call

First, split each element by v

tmp <- do.call(strsplit, list(text, split=v))
tmp
[[1]]
[1] "apple " " fuji" 

[[2]]
[1] "pear " " name"

[[3]]
[1] "banana " " mango"

Now, rbind.data.frame these, which drops the second column, and returns a data.frame cbind the split and name variables, and then add names with setNames.

setNames(cbind(text, do.call(rbind.data.frame, tmp), v)[c(1, 2, 4, 2)], varNames)

this returns

               name   part1  split   part2
1 apple please fuji  apple  please  apple 
2    pear help name   pear    help   pear 
3   banana me mango banana      me banana

markdly · Answer 2 · 2017-10-05T12:35:50.753

This solution assumes the number of elements in v is equal to the number of rows in the dataframe. You can use separate from the tidyr package to create part1 and part2.

library(tidyverse)
df <- tibble(name = c("apple please fuji", "pear help name", "banana me mango"))
v <- c("please", "help", "me")

df %>% 
  separate(name, c("part1", "part2"), v, remove = FALSE) %>%
  add_column(split = v, .before = "part2")
#> # A tibble: 3 x 4
#>                name   part1  split  part2
#>               <chr>   <chr>  <chr>  <chr>
#> 1 apple please fuji  apple  please   fuji
#> 2    pear help name   pear    help   name
#> 3   banana me mango banana      me  mango

If you want to try and split each row using any element in v then you could try pasting v into a single pattern first before separating. I think something like this should work.

library(tidyverse)
library(stringr)
p <- paste0("\\b(?:", paste(v, collapse = "|"), ")\\b")
df %>% 
  separate(name, c("part1", "part2"), p, remove = FALSE) %>%
  mutate(split = str_extract(name, p)) %>%
  select(name, part1, split, part2)
#> # A tibble: 3 x 4
#>                name   part1  split  part2
#>               <chr>   <chr>  <chr>  <chr>
#> 1 apple please fuji  apple  please   fuji
#> 2    pear help name   pear    help   name
#> 3   banana me mango banana      me  mango

ngamita · Answer 3 · 2017-10-05T11:49:06.690

0

# Creating creating the df
name <- c("apple please fuji","pear help name","banana me mango")

# as.data.frame
df <- as.data.frame(name, stringsAsFactors = F)
# Initialize empty data frame. 
df_n <- data.frame()
# Loop through the original rows of the df. 
for(i in 1:nrow(df)){
  for(j in 1:nrow(df)){
    o <- strsplit(df$name, " ")[[i]][j]
  }
}
# rename and assign new df (df_n) changes to original df. 
df$part1 <- df_n$V1
df$part2 <- df_n$V2
df$part3 <- df_n$V3

print(df)

edited Oct 05 '17 at 11:49

answered Oct 05 '17 at 11:35

ngamita

329
2
12

Thanks for this! For loops are quite slow, and my dataframe size is quite large, so unfortunately these won't work for me. Do you have any ideas on how to vectorize this, perhaps using apply? Thank you! – maria Oct 06 '17 at 01:21

String splitting a dataframe with a vector as the pattern in R

3 Answers3