R strsplit before ( and after ) keeping both delimiters

Question

I have a string that looks like the following:

x <- "01(01)121210(01)0001"

I want to split this into a vector so that i get the following:

[1] "0" "1" "(01)" "1" "2" "1" "2" "1" "0" "(01)" "0" "0" "0" "1"

The (|) could be [|] or {|} and the number of digits between the brackets can be 2 or more.

I've been trying to do this by separating on the brackets first:

unlist(strsplit(x, "(?<=[\\]\\)\\}])", perl=T))
[1] "01(01)" "121210(01)" "0001"

or unlist(strsplit(x, "(?<=[\\[\\(\\{])", perl=T))
[1] "01(" "01)121210(" "01)0001"

but I can't find a way to combine the two together. Then, I was hoping to split the elements not containing the brackets.

I'd be really grateful if someone can help me out with this or know of a more elegant way to do this.

Many thanks!

Avinash Raj · Answer 1 · 2014-08-06T12:32:06.943

4

Just change the PERL option to TRUE and split the input string based on the below pattern.

(?<!\(|^)(?!\)|\d\)|$)

DEMO

R regex would be,

"(?<!\\(|^)(?!\\)|\\d\\)|$)"

edited Aug 06 '14 at 12:32

answered Aug 06 '14 at 12:25

Avinash Raj

172,303
28
230
274

Matthew Plourde · Accepted Answer · 2014-08-06T13:30:16.447

3

This is another way:

unlist(strsplit(x, '\\([^)]*\\)(*SKIP)(*F)|(?=)', perl=T))
# [1] "0"    "1"    "(01)" "1"    "2"    "1"    "2"    "1"    "0"    "(01)" "0"    "0"    "0"    "1"

\\([^)]*\\) matches anything in parentheses, and (*SKIP)(*F) tells the regular expression engine to fail on this pattern and if it finds that pattern in the string, do not re-test that part of the string using the alternative pattern on the other side of the |. The pattern on the other side of the | is (?=), and this matches the space between characters.

edited Aug 06 '14 at 13:30

answered Aug 06 '14 at 12:44

Matthew Plourde

43,932
7
96
113

I really liked this answer too, but it seems I can only select one. Really sorry about this. BTW, how would I modify this to use { } or [ ] ? – mamboSC4649 Aug 07 '14 at 08:09

G. Grothendieck · Answer 3 · 2014-08-06T16:47:43.560

1

This can be done without zero width look ahead/behind expressions using strapply in the gsubfn package. The regular expression matches a digit or a ( until the next ).

library(gsubfn)

strapply(x, "\\d|\\(.*?\\)", c, perl = TRUE)[[1]]

giving:

 [1] "0"    "1"    "(01)" "1"    "2"    "1"    "2"    "1"    "0"    "(01)"
[11] "0"    "0"    "0"    "1"

Note: In the example shown in the question the part inside (...) is always two digits. If that is always the case it can be simplified further to:

strapplyc(x, "\\d|\\(...")[[1]]

UPDATE Added note.

edited Aug 06 '14 at 16:47

answered Aug 06 '14 at 12:38

G. Grothendieck

254,981
17
203
341

In this case why not use `gregexpr`? – Casimir et Hippolyte Aug 06 '14 at 12:41
Because of simplicity. – G. Grothendieck Aug 06 '14 at 12:42
`grexexpr` alone is not a solution. Obviously the solution shown here is simpler. – G. Grothendieck Aug 06 '14 at 12:45
I was asking you for the same pattern (obviously). In other words something like that: `m<-gregexpr("\\d|\\(.*?\\)", x) regmatches(x, m)` – Casimir et Hippolyte Aug 06 '14 at 12:47
That involves extracting indexes and then applying a second operation on it - not exactly simple when it can be done in a single command. Also if you want to do something with each output component `c` can be replaced with an arbitrary function without much additional complexity. – G. Grothendieck Aug 06 '14 at 16:07
Ok, in other words it allows to map the result array in one shot. – Casimir et Hippolyte Aug 06 '14 at 16:45

Casimir et Hippolyte · Answer 4 · 2014-08-06T13:22:07.227

1

An other possible way:

unlist(strsplit(x, '(?!\\(?\\d*\\))', perl=T))

Shorter but, less efficient than Matthew Plourde way

or a way like G. Grothendieck wrotes:

m<-gregexpr("\\d|\\([^)]*\\)", x)
regmatches(x, m)

edited Aug 06 '14 at 13:22

answered Aug 06 '14 at 13:00

Casimir et Hippolyte

88,009
5
94
125

R strsplit before ( and after ) keeping both delimiters

4 Answers4

Linked