2

I have a vector with strings like:

x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')

I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!

Sorry if this is similar to other SO questions eg here or here).

Community
  • 1
  • 1
user3375672
  • 3,728
  • 9
  • 41
  • 70

3 Answers3

3

This is a bit different then what you had in mind, but it will do the job.

gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)

The logic is the following, you have 3 "sets" of strings:

1) characters .* ending in class-

2) characters .

3) Characters starting with _sample and characters afterwords .*

From those you want to keep the second "set" \\2.

Or another maybe easier to understand:

gsub("(.*class-)|(_sample.*)", "", x)

Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""

dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • Sure it does (I will have to understand lookaround another day then). Could you explain briefly the regex pattern ? – user3375672 Aug 06 '15 at 09:16
1

We could use str_extract_all from library(stringr)

 library(stringr)
 unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
 #[1] "X1(z)20" "Z3(z)29"

This should also work if there are multiple instances of the pattern within a string

 x1 <- paste(x, x)
 str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
 #[[1]]
 #[1] "X1(z)20" "X1(z)20"

 #[[2]]
 #[1] "Z3(z)29" "Z3(z)29"

Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.

akrun
  • 874,273
  • 37
  • 540
  • 662
0
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"
Shenglin Chen
  • 4,504
  • 11
  • 11