R: regular expression lookaround(s) to grab whats between two patterns

Question

I have a vector with strings like:

x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')

I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!

Sorry if this is similar to other SO questions eg here or here).

dimitris_ps · Accepted Answer · 2015-08-06T09:22:24.460

3

This is a bit different then what you had in mind, but it will do the job.

gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)

The logic is the following, you have 3 "sets" of strings:

1) characters .* ending in class-

2) characters .

3) Characters starting with _sample and characters afterwords .*

From those you want to keep the second "set" \\2.

Or another maybe easier to understand:

gsub("(.*class-)|(_sample.*)", "", x)

Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""

edited Aug 06 '15 at 09:22

answered Aug 06 '15 at 09:13

dimitris_ps

5,849
3
29
55

Sure it does (I will have to understand lookaround another day then). Could you explain briefly the regex pattern ? – user3375672 Aug 06 '15 at 09:16

akrun · Answer 2 · 2015-08-06T09:48:42.907

We could use str_extract_all from library(stringr)

 library(stringr)
 unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
 #[1] "X1(z)20" "Z3(z)29"

This should also work if there are multiple instances of the pattern within a string

 x1 <- paste(x, x)
 str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
 #[[1]]
 #[1] "X1(z)20" "X1(z)20"

 #[[2]]
 #[1] "Z3(z)29" "Z3(z)29"

Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.

score 0 · Answer 3 · answered Aug 06 '15 at 11:10

0

gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

answered Aug 06 '15 at 11:10

Shenglin Chen

4,504
11
11

2

This answer could really do with a more detailed explanation to go with it. – Sobrique Aug 06 '15 at 13:38

R: regular expression lookaround(s) to grab whats between two patterns

3 Answers3