4

I have the following code:

input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"

innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')

This outputs:

"I2-I3"    "I3-I1"    "I3-I2"    "I2-I1-I3" "I3"       "I2-I3" 

However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:

only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.

This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.

The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.

How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.

histelheim
  • 4,938
  • 6
  • 33
  • 63

5 Answers5

2

I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.

For instance:

(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt

imbalind
  • 1,182
  • 6
  • 13
2

Use (*SKIP)(*F)

innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))

Syntax would be like,

 partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • I can't get this to work in R: `Error in stri_extract_all_regex(string, pattern, simplify = simplify, : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)` – histelheim Aug 03 '15 at 13:43
  • think it would be `perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*')` – Avinash Raj Aug 03 '15 at 13:45
  • I think the `(*SKIP)(*F)` have some problems with `stri_extract_all` (str_extract is using the `stringi` library). The `perl(...` still gives me error. – akrun Aug 03 '15 at 14:05
2

There should be something more intuitive but i think this will do the job

literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
           a[[x]][[1]][1]
     })

print(b)
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
2

Here's is another way you could go about this.

x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"

CODE

substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
hwnd
  • 69,796
  • 4
  • 95
  • 132
1

Here's how to implement it:

lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"
bgoldst
  • 34,190
  • 6
  • 38
  • 64