2

From the text string below, I am trying to extract a specific string subset.

string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)", 
            "scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)", 
            "I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)", 
            "scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)", 
            "scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)", 
            "I(scale(Slope_30)^2)")

A good result would return the central text without any special characters, as shown below.

Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
            "SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")

Preferably however, the resulting string would account for the ^2 and log associated with 'Slope' and 'SlopeVar', respectively. Specifically, all strings containing ^2 would be converted to 'SlopeSq' and all strings containing log would be converted to 'SlopeVarPs', as show below.

Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")

I have a long, ugly, and inefficient code sequence that gets me nearly halfway to the Good result and would appreciate any suggestions.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
B. Davis
  • 3,391
  • 5
  • 42
  • 78

2 Answers2

3

As a not-so-efficient coder, I like to have a chain of multiple regex to achieve the outcome (what each line of regex does is commented in each line):

library(stringr)
library(dplyr)
string %>% 
  str_replace_all(".*log\\((.*?)(_.+?)?\\).*", "\\1Ps") %>% # deal with "log" entry
  str_replace_all(".*\\((.*?\\))", "\\1") %>% # delete anything before the last "(" 
  str_replace_all("(_\\d+)?\\)\\^2", "Sq") %>%  # take care of ^2
  str_replace_all("(_.+)?\\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")


Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
all(outcome == Best)
## TRUE
amatsuo_net
  • 2,409
  • 11
  • 20
1

I think this can be achieved with the package stringr.

First, because you want the "central text" within the innermost parentheses. Thus the regex below rule out any text within parentheses containing parentheses. But I kept "log/^2" for later uses.

string_step <- str_extract(string,
                           "(log|)\\([^()]+\\)(\\^2|)")

Then I notice that anything after an underscore is truncated, but only phrases of alphebat (and digit) are kept. Unlike \w (\w in R), which includes underscore, "[:alnum:]+" equals "[A-Za-z0-9]+", and is therefore used.

GoodMy <-
  str_extract(str_replace_all(string_step, "log|\\(|\\)|\\^2", ""),
              "[:alnum:]+")

BestMy <-
  paste0(Good, as.character(sapply(string_step, function(x) {
    if (str_detect(x, "log")) {
      "Ps"
    } else if (str_detect(x, "\\^2")) {
      "Sq"
    } else {
      ""
    }
  })))

all(Good == GoodMy, Best == BestMy) #yields True