R: strsplit based on two conditions, keeping deliminator

Question

I am trying to split sentences based on different criteria. I am looking to split some sentences after "traction" and some after "ramasse". I looked up the grammar rules for grepl but didn't really understand.

A data frame called export has a column ref, which has str values ending either with "traction" or "ramasse".

>export$ref
                        ref
[1] "62133130_074_traction"
[2]  "62156438_074_ramasse"
[3]  "62153874_070_ramasse"
[4] "62138861_074_traction"

And I want to split str values in ref column into two.

                ref           R&T
[1] "62133130_074_"    "traction"
[2] "62156438_074_"     "ramasse"
[3]  "62153874_070_"    "ramasse"
[4] "62138861_074_"    "traction"

What I tried(none of them was good)

strsplit(export$ref, c("traction", "ramasse"))
strsplit(export$ref, "\\_(?<=\\btraction)|\\_(?<=\\bramasse)", perl = TRUE)
strsplit(export$ref, "(?=['traction''ramasse'])", perl = TRUE)

Any help would be appreciated!

score 2 · Answer 1 · answered Jun 15 '18 at 08:59

Here's a different approach:

strsplit(x, "_(?=[^_]+$)", perl = TRUE)

[[1]]
[1] "62133130_074" "traction"    

[[2]]
[1] "62156438_074" "ramasse"     

[[3]]
[1] "62153874_070" "ramasse"     

[[4]]
[1] "62138861_074" "traction"

This means split the column / vector at an underscore ("_") which is followed by any number of symbols that don't contain another underscore.

score 0 · Accepted Answer · answered Jun 15 '18 at 09:06

Here is another option using stringr::str_split:

library(stringr);
str_split(ref, pattern = "_(?=[A-Za-z]+)", simplify = T)
#    [,1]           [,2]
#[1,] "62133130_074" "traction"
#[2,] "62156438_074" "ramasse"
#[3,] "62153874_070" "ramasse"
#[4,] "62138861_074" "traction"

Sample data

ref <- c(
    "62133130_074_traction",
    "62156438_074_ramasse",
    "62153874_070_ramasse",
    "62138861_074_traction")

R: strsplit based on two conditions, keeping deliminator

2 Answers2

Sample data