I have a pairs of strings included in a data frame:
df <- data.frame(str = c("L_V1_ROI-L_MST_ROI",
"L_V6_ROI-L_V2_ROI",
"L_V3_ROI-L_V4_ROI",
"L_V8_ROI-L_4_ROI",
"L_p9-46v_ROI-L_a9-46v_ROI"))
Each pair is separated by -
symbol with the exception of the last pair which contains three -
symbols and should be separated into substrings L_p9-46v_ROI
and L_a9-46v_ROI
.
A task is to split these pairs into substrings according to the separator. To do this I simply use:
library(tidyr)
df %>% separate(data = df, col = str, into = c("str1", "str2"), sep = "-")
which gives the following result:
str1 str2
1 L_V1_ROI L_MST_ROI
2 L_V6_ROI L_V2_ROI
3 L_V3_ROI L_V4_ROI
4 L_V8_ROI L_4_ROI
5 L_p9 46v_ROI
Warning message:
Too many values at 1 locations: 5
As expected, the problem lies in the 5th pair which has more than one -
symbol.
Question: what is the regex to match the proper separator?
My partial solution is pasted below, but I hope that there should be more intelligent solution.
my_split <- function(string, pattern) {
## Match start end end position of the "_ROI-"
position <- str_locate(string = string, pattern = pattern)
start <- position[1]
end <- position[2]
## Extract substrings
substring1 <- substr(my_str, 1, start + 3)
substring2 <- substr(my_str, end + 1, nchar(string))
return(list(substring1, substring2))
}
## Toy example
my_str <- "L_p9-46v_ROI-L_a9-46v_ROI"
my_split(string = my_str, pattern = "_ROI-")
[[1]]
[1] "L_p9-46v_ROI"
[[2]]
[1] "L_a9-46v_ROI"