Split string according to ambiguous delimiter in R

Question

I have a pairs of strings included in a data frame:

df <- data.frame(str = c("L_V1_ROI-L_MST_ROI",
                         "L_V6_ROI-L_V2_ROI",
                         "L_V3_ROI-L_V4_ROI",
                         "L_V8_ROI-L_4_ROI",
                         "L_p9-46v_ROI-L_a9-46v_ROI"))

Each pair is separated by - symbol with the exception of the last pair which contains three - symbols and should be separated into substrings L_p9-46v_ROI and L_a9-46v_ROI.

A task is to split these pairs into substrings according to the separator. To do this I simply use:

library(tidyr)
df %>% separate(data = df, col = str, into = c("str1", "str2"), sep = "-")

which gives the following result:

      str1      str2
1 L_V1_ROI L_MST_ROI
2 L_V6_ROI  L_V2_ROI
3 L_V3_ROI  L_V4_ROI
4 L_V8_ROI   L_4_ROI
5     L_p9   46v_ROI
Warning message:
Too many values at 1 locations: 5

As expected, the problem lies in the 5th pair which has more than one - symbol.

Question: what is the regex to match the proper separator?

My partial solution is pasted below, but I hope that there should be more intelligent solution.

my_split <- function(string, pattern) {
  ## Match start end end position of the "_ROI-"
  position <- str_locate(string = string, pattern = pattern)
  start <- position[1]
  end <- position[2]
  ## Extract substrings
  substring1 <- substr(my_str, 1, start + 3)
  substring2 <- substr(my_str, end + 1, nchar(string))
  return(list(substring1, substring2))
}

## Toy example
my_str <- "L_p9-46v_ROI-L_a9-46v_ROI"
my_split(string = my_str, pattern = "_ROI-")
[[1]]
[1] "L_p9-46v_ROI"

[[2]]
[1] "L_a9-46v_ROI"

`strsplit(as.character(df$str), "(?<=ROI)-", perl=TRUE)` should do (using lookbehind: splitting on `-` only if preceded by `ROI`) — Cath, Aug 25 '17 at 08:28

Split string according to ambiguous delimiter in R

0 Answers0