3

I have a dataframe, from which I want to select important columns, and then filter the rows to contain specific ending.

Regex expression make it simple to define my ending value using xx$ symbol. But, how to vary over multiple possible endings (xx$, yy$)?

Dummy example:

require(dplyr)

x <- c("aa", "aa", "aa", "bb", "cc", "cc", "cc")
y <- c(101, 102, 113, 201, 202, 344, 407)
type = rep("zz", 7)
df = data.frame(x, y, type)    

# Select all expressions that starts end by "7"
df %>%
  select(x, y) %>%
  filter(grepl("7$", y))

# It seems working when I explicitly define my variables, but I need to use it as a vector instead of values?
df %>%
  select(x, y) %>%
  filter(grepl("[2|7]$", y))  # need to modify this using multiple endings


# How to modify this expression, to use vector of endings (ids) instead?
ids = c(7,2)     # define vector of my values

df %>%
     select(x, y) %>%
     filter(grepl("ids$", y))  # how to change "grepl(ids, y)??"

Expected output:

   x   y type
1 aa 102   zz
2 cc 202   zz
3 cc 407   zz

Example based on this question: Regular expressions (RegEx) and dplyr::filter()

maycca
  • 3,848
  • 5
  • 36
  • 67
  • Thank you, this works well if I specify `grepl("[2|7]$", y)`. But as this is only a dummy example, I need to rewrite it to use instead a vector of variables, ie. `ids = c(2,7)`. How to put this into `grepl` statement? grepl("ids$", y) obviously does not work... – maycca Jun 18 '19 at 12:51
  • Ok, this seems to work: `df %>% select(x, y) %> filter(grepl(paste(ids, collapse="|"), y))`. But I don't understand why now I did not have to specify the `$` at the end of the regex statement? Can you please post your comment as answer? I understand that there are many examples, but I could not imagine how to put them together... Thank you again for you help! :) – maycca Jun 18 '19 at 14:25
  • Use `paste0` to add what remains, `df %>% select(x, y) %> filter(grepl(paste0("(?:", paste(ids, collapse="|"), ")$"), y))` – Wiktor Stribiżew Jun 18 '19 at 14:25
  • thank you. Please, can you place your comment as an answer, that I can accept it? – maycca Jun 19 '19 at 15:56
  • Oh yes, sorry, I did not notices this answer before! Thank you for sharing this. I still would be happy if I can update the answer to my question, as maybe some other dummies will ask the same question with different words, and might find this one first. Moreover, my question combines selecting columns `select` and than rows using `filter`, which is missing from suggested answer. – maycca Jun 20 '19 at 08:28
  • Ok, I see the question has raised some interest, I posted an answer. Please clean up the comments. – Wiktor Stribiżew Jun 20 '19 at 10:16

1 Answers1

5

You may use

df %>% 
  select(x, y) %> filter(grepl(paste0("(?:", paste(ids, collapse="|"), ")$"), y))

The paste0("(?:", paste(ids, collapse="|"), ")$") part will build an alternation pattern that will only match at the end of the string due to $ anchor at the end.

NOTE: If the values can have special regex metacharacters you need to escape the values in the character vector first:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
df %>% 
      select(x, y) %> filter(grepl(paste0("(?:", paste(regex.escape(ids), collapse="|"), ")$"), y))
                                                       ^^^^^^^^^^^^^^^^^

For example, paste0("(?:", paste(c("7", "8", "ids"), collapse="|"), ")$") will output (?:7|8|ids)$:

  • (?: - start of a non-capturing group that will act as a container for the alternatives, so that the $ anchor applied to all of them and not to just the last one, matching any of
    • 7 - a 7 char
  • | - or
  • 8 - an 8 char
  • | - or
  • ids - an ids substring
  • ) - end of the group
  • $ - end of the string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563