0

I have a vector of strings that look like this:

G30(H).G3(M).G0(L).Replicate(1)

Iterating over c("H", "M", "L"), I would like to extract G30 (for "H"), G3 (for "M") and G0 (for "L").

My various attempts have me confused - the regex101.com debugger, e.g. indicates that (\w*)\(M\) works just fine, but transferring that to R fails ...

balin
  • 1,554
  • 1
  • 12
  • 26
  • See the regex101.com attempt [here](https://regex101.com/r/cZ0sD2/66). – balin Aug 15 '17 at 11:51
  • you've received a lot of responses! Consider accepting one (check mark to the left of answers) that you found the most helpful and most general. It lets the community know the answer worked for you and acknowledges the help you've received – CPak Sep 14 '17 at 10:07

5 Answers5

2

Using the stringi package and the outer() function:

library(stringi)

strings <- c(
  "G30(H).G3(M).G0(L).Replicate(1)",
  "G5(M).G11(L).G6(H).Replicate(9)",
  "G10(M).G6(H).G8(M).Replicate(200)"  # No "L", repeated "M"
)
targets  <- c("H", "M", "L")
patterns <- paste0("\\w+(?=\\(", targets, "\\))")
matches  <- outer(strings, patterns, FUN = stri_extract_first_regex)
colnames(matches) <- targets
matches
#      H     M    L    
# [1,] "G30" "G3" "G0" 
# [2,] "G6"  "G5" "G11"
# [3,] "G6"  "G10" NA

This ignores any instances of a target letter past the first, gives you an NA when the target's not found, and returns everything in a simple matrix. The regular expressions stored in patterns match substrings like XX(Y), where Y is the target letter and XX is any number of word characters.

Nathan Werth
  • 5,093
  • 18
  • 25
1

I am pretty sure there are better solutions, but this works...

jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
patter <- '([^\\(]+)\\(H\\)\\.([^\\(]+)\\(M\\)\\.([^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
H <- sub(patter, '\\1', jnk)
M <- sub(patter, '\\2', jnk)
L <- sub(patter, '\\3', jnk)

EDIT:

Actually, I found once a very nice function parse.one which makes it possible to search more in a python like regular expression way...

Have a look at this:

parse.one <- function(res, result) {
  m <- do.call(rbind, lapply(seq_along(res), function(i) {
    if(result[i] == -1) return("")
    st <- attr(result, "capture.start")[i, ]
    substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
  }))
  colnames(m) <- attr(result, "capture.names")
  m
}
jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
pattern <- '(?<H>[^\\(]+)\\(H\\)\\.(?<M>[^\\(]+)\\(M\\)\\.(?<L>[^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))

Result looks like this:

> parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))
     H     M    L   
[1,] "G30" "G3" "G0"
drmariod
  • 11,106
  • 16
  • 64
  • 110
1

If the order is always the same, an alternative might be to split the strings. For instance:

string <- "G30(H).G3(M).G0(L).Replicate(1)"
tmp <- str_split(string, "\\.")[[1]]
lapply(tmp[1:3], function(x) str_split(x, "\\(")[[1]][1])
[[1]]
[1] "G30"

[[2]]
[1] "G3"

[[3]]
[1] "G0"
coffeinjunky
  • 11,254
  • 39
  • 57
1

If codes (e.g., 'G30') preceding the tags(e.g., '(H).') or the order of the tags in the string are allowed to change (different letters or length), you may want to try a more flexible solution based on regexpr().

aa <-paste("G30(H).G3(M).G0(L).Replicate(",1:10,")", sep="")
my.tags <- c("H","M", "L")

extr.data <- lapply(my.tags, (function(tag){
  pat <-  paste("\\(", tag, "\\)\\.", sep="")
  pos <- regexpr(paste("(^|\\.)([[:alnum:]])*", pat ,sep=""), aa)
  out <- substr(aa, pos, (pos+attributes(pos)$match.length - 4 - length(tag)))  
  gsub("(^\\.)", "", out) 
}))
names(extr.data) <- my.tags
extr.data
Damiano Fantini
  • 1,925
  • 9
  • 11
1

I'm going to assume that the functions (G...) are variable and the inputs are variable. This does assume that your functions start with a G and your input is always a letter.

parse = function(arb){
  tmp = stringi::stri_extract_all_regex(arb,"G.*?\\([A-Z]\\)")[[1]]
  unlist(lapply(lapply(tmp,strsplit,"\\)|\\("),function(x){
    output = x[[1]][1]
    names(output) = x[[1]][2]
    return(output)
  }))
}

This first parses out all the G functions with their inputs. Then, each of those is split into their function part and their input part. This is the put into a character vector of functions named for their input.

parse("G30(H).G3(M).G0(L).Replicate(1)")
>     H     M     L 
  "G30"  "G3"  "G0"

Or

parse("G35(L).G31(P).G02(K).Replicate(1)")
>     L     P     K 
  "G35" "G31" "G02" 
Mark
  • 4,387
  • 2
  • 28
  • 48