0

Not sure I worded my question all that well but its essentially what I am trying to do.

Data example:

Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
"NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")

I am wanting to filter using a str_detect on the last set of letter combinations. There will always be four " _ " before the string/pattern I am looking for is, but after the fourth " _ " there could be many different letter combinations. In the above example I am trying to detect only the letter "Q".

If I do a simple Data2 <- Data %>% filter(str_detect(column, "Q")) I would get all rows that have Q anywhere in the string. How can I tell it to focus on the last section only?

Checht
  • 45
  • 10

3 Answers3

1

If I understand your question correctly, then you can do something like this:

library(stringr)
str_detect(Data, ".*_.*_.*_.*_.*Q.*$")
#R> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

This will detect if there is any "Q" after the fourth "_".

Looking at the title:

detecting string after 4 constant characters

then you can make a general function that does this like this:

# returns TRUE if a certain character occurs after a character has been 
# there four times.
# 
# Args: 
#   x characters to check.
#   what character to occur at the end. 
#   after character to occur four times.
detect_after_four_times <- function(x, what, after){
  reg <- sprintf(".*%s.*%s.*%s.*%s.*%s.*$", after, after, after, after, 
                 what)
  str_detect(x, reg)
}

detect_after_four_times(Data, "Q", "_")
#R> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
detect_after_four_times(Data, "R", "_") # look for R instead
#R> [1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE

# also works if there are only three times of "after"
detect_after_four_times("only_three_dashes_Q", "Q", "_")
#R> [1] FALSE
1

If you want to use the tidyverse:

library(magrittr)

data <- tibble::tibble(Col =  c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", 
                                "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
                                "NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", 
                                "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ"))

data %>% 
  dplyr::mutate(Col = purrr::map_lgl(Col,
                                     ~ stringr::str_detect(
                                       unlist(
                                         stringr::str_split(.x, 
                                                            "_"))[5], 
                                       "Q")))
#> # A tibble: 8 x 1
#>   Col  
#>   <lgl>
#> 1 FALSE
#> 2 FALSE
#> 3 FALSE
#> 4 TRUE 
#> 5 FALSE
#> 6 TRUE 
#> 7 TRUE 
#> 8 TRUE

Created on 2020-11-05 by the reprex package (v0.3.0)

Florian
  • 1,248
  • 7
  • 21
1

If the aim is to detect/match those strings that contain Qin the 'section' after the last _, then this works:

grep("_[A-Z]*Q[A-Z]*$", Data, value = T, perl = T)
[1] "NELIG_Q2_1_C5_Q"   "NELIG_Q1_1_EG1_QR" "NELIG_V2_1_NTH_PQ" "NELIG_N2_1_C5_PRQ"

or, with str_detect:

library(stringr)
str_detect(Data, "_[A-Z]*Q[A-Z]*$")
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

Data:

Data <- c("NELIG_Q1_1_C1_A", "NELIG_N1_1_EG1_B", "NELIG_V2_1_NTH_C", "NELIG_Q2_1_C5_Q",
          "NELIG_N1_1_C1_RA", "NELIG_Q1_1_EG1_QR", "NELIG_V2_1_NTH_PQ", "NELIG_N2_1_C5_PRQ")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • This is probably my favorite answer as it is the most 'strict' in using tidyverse type solutions.. I just want to understand how/where in the "_[a-z]*Q[a-z]*$" it tells it to only search the last section? I understand we are saying to detect a string in DATA, looking for a string that starts with _ (has a character) **unsure what the *Q does**(followed by character)** unsure what *$ does) – Checht Nov 10 '20 at 22:42
  • 1
    `_[a-z]*Q[a-z]*$` catches the last 'section', as you say, because of the anchor `$`. That's a zero-width metacharacter anchoring a pattern to a **position** in the string, namely the very end of it! (The opposite anchor is `^`, which ties a pattern to the very start of the string.) – Chris Ruehlemann Nov 11 '20 at 06:59
  • Thank you! That makes quite a bit of sense. – Checht Nov 12 '20 at 02:53