Extract text from inner-most nested parentheses of string

Question

From the text string below, I am trying to extract a specific string subset.

string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)", 
            "scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)", 
            "I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)", 
            "scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)", 
            "scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)", 
            "I(scale(Slope_30)^2)")

A good result would return the central text without any special characters, as shown below.

Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
            "SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")

Preferably however, the resulting string would account for the ^2 and log associated with 'Slope' and 'SlopeVar', respectively. Specifically, all strings containing ^2 would be converted to 'SlopeSq' and all strings containing log would be converted to 'SlopeVarPs', as show below.

Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")

I have a long, ugly, and inefficient code sequence that gets me nearly halfway to the Good result and would appreciate any suggestions.

amatsuo_net · Accepted Answer · 2017-06-16T16:34:09.397

3

As a not-so-efficient coder, I like to have a chain of multiple regex to achieve the outcome (what each line of regex does is commented in each line):

library(stringr)
library(dplyr)
string %>% 
  str_replace_all(".*log\\((.*?)(_.+?)?\\).*", "\\1Ps") %>% # deal with "log" entry
  str_replace_all(".*\\((.*?\\))", "\\1") %>% # delete anything before the last "(" 
  str_replace_all("(_\\d+)?\\)\\^2", "Sq") %>%  # take care of ^2
  str_replace_all("(_.+)?\\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")


Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
all(outcome == Best)
## TRUE

edited Jun 16 '17 at 16:34

answered Jun 16 '17 at 16:23

amatsuo_net

2,409
11
20

Very much appreciated. Clear and informative! also did not know you could use the piping operator with stringer. Cool. – B. Davis Jun 16 '17 at 16:31
Piping is actually coming from `dplyr`. I edited my answer. – amatsuo_net Jun 16 '17 at 16:33

Xu Shaoyang · Answer 2 · 2019-04-25T15:53:05.647

I think this can be achieved with the package stringr.

First, because you want the "central text" within the innermost parentheses. Thus the regex below rule out any text within parentheses containing parentheses. But I kept "log/^2" for later uses.

string_step <- str_extract(string,
                           "(log|)\\([^()]+\\)(\\^2|)")

Then I notice that anything after an underscore is truncated, but only phrases of alphebat (and digit) are kept. Unlike \w (\w in R), which includes underscore, "[:alnum:]+" equals "[A-Za-z0-9]+", and is therefore used.

GoodMy <-
  str_extract(str_replace_all(string_step, "log|\\(|\\)|\\^2", ""),
              "[:alnum:]+")

BestMy <-
  paste0(Good, as.character(sapply(string_step, function(x) {
    if (str_detect(x, "log")) {
      "Ps"
    } else if (str_detect(x, "\\^2")) {
      "Sq"
    } else {
      ""
    }
  })))

all(Good == GoodMy, Best == BestMy) #yields True

Extract text from inner-most nested parentheses of string

2 Answers2