Use perl=TRUE regex in dplyr select

Question

How can I select cols using perl = TRUE like regex.

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)"))

Error in grep(needle, haystack, ...) : invalid regular expression '(?i)b(?!a)', reason 'Invalid regexp'

regex is indeed valid.

grep("(?i)b(?!a)",c("baa","boo","boa","lol","bAa"),perl=T)

> [1] 2 3

Is there a shortcut function/way?

I don't see how you are calling it a valid regex. In `(?i)`, `?` is optional quantifier for nothing. Should be preceded by something. — Rahul, Dec 19 '17 at 14:56
@Rahul you can learn all about regEx on https://regex101.com/. Its a very cool site. — Andre Elrico, Dec 19 '17 at 15:23

LyzandeR · Accepted Answer · 2017-12-19T18:16:05.483

8

matches in dplyr does not support perl = TRUE. However, you can make your own functions. After a bit of digging in the source code this works:

The fast way:

library(dplyr)

#notice the 3 colons because grep_vars is not exported from dplyr
matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) 
{
  dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE)
}

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% select(matches2("(?i)b(?!a)"))
#boo boa
#1   0   0

Or a more explanatory solution:

matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) 
{
  grep_vars2(match, vars, ignore.case = ignore.case)
}

#this is pretty much my only change in the original dplyr:::grep_vars
#to make it accept perl.
grep_vars2 <- function (needle, haystack, ...) 
{
  grep(needle, haystack, perl = TRUE, ...)
}

 data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% 
   select(matches2("(?i)b(?!a)"))
 #boo boa
 #1   0   0

edited Dec 19 '17 at 18:16

answered Dec 19 '17 at 15:00

LyzandeR

37,047
12
77
87

Thats a valid workaround!! Is there really no function that can interpret perl like regex in dplyr select? I think thats a bug in dplyr that should be reported. – Andre Elrico Dec 19 '17 at 15:03
I would say an enhancement rather than a bug. But it can be very easily implemented as you can see. You can always add it as an issue on github or even submit a pull request since it is something quite easy. – LyzandeR Dec 19 '17 at 15:04
As a side note, you might also directly skip dplyr vocabulary in the pipe to maintain all functionality of any function you use, e.g. `grep` or `stringi::stri_detect`, etc. as follows (note the dot): `df %>% .[,grep("(?i)b(?!a)", colnames(.), perl = T)]` – Manuel Bickel Dec 19 '17 at 15:06
@ManuelBickel Maybe worth posting as an alternative answer? – LyzandeR Dec 19 '17 at 15:07
All right, will do so, just thought your answer is already good enough. – Manuel Bickel Dec 19 '17 at 15:08
In the current suite of tidyverse functions, it looks like `dplyr:::grep_vars` needs to be changed to `tidyselect:::grep_vars`, as the selection helper function now reside in the `tidyselect` package. – eipi10 Nov 21 '19 at 18:00

score 1 · Answer 2 · answered Dec 19 '17 at 15:21

Another approach, although along the lines and probably more dangerous than LyzandeR's suggestion:

body(matches)[[grep("grep_vars", body(matches))]] <- substitute(grep_vars(match, vars, ignore.case = ignore.case, perl=T))

data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)"))
  boo boa
1   0   0

I would not use body(matches)[[3]] as any updates would cause this little patch create problems.

Manuel Bickel · Answer 3 · 2017-12-20T09:29:25.397

As an amendment/side note to LyzandeRs answer here a version that does not use dplyr vocabulary, only the magrittr pipe. Hence, writing wrapper functions and specifying arguments, etc. may be skipped.

This is a bit more verbose than dplyr. But it is less verbose than base and allows to use the full flexibility of any function such as grep or stringi::stri_detect, etc.

And it is significantly faster. Check below benchmarks. It should be noted, of course, that speed would have to be checked for larger examples, the overhead of dplyr is quite large for this small example, hence, a fair speed comparison depends on the use case.

df <- data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0)

library(magrittr)
df %>% 
.[,grep("(?i)b(?!a)", names(.), perl = T)]
#    boo boa
# 1   0   0

#in the following a copy of LyzanderRs approaches
library(dplyr)
matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) {
                      dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE)
                      }

grep_vars2 <- function (needle, haystack, ...) {
                        grep(needle, haystack, perl = TRUE, ...)
                        }

matches3 <- function (match, ignore.case = TRUE, vars = current_vars()) {
                      grep_vars2(match, vars, ignore.case = ignore.case)
                      }

library(microbenchmark)
microbenchmark(
  df %>% select(matches2("(?i)b(?!a)")),
  df %>% select(matches3("(?i)b(?!a)")),
  df %>% .[,grep("(?i)b(?!a)", names(.), perl = T)]
)

# Unit: microseconds
#                 expr                                 min       lq      mean     median        uq       max    neval
# df %>% select(matches2("(?i)b(?!a)"))              3994.867 4309.877 4570.6414 4555.8065 4726.9310  6618.769   100
# df %>% select(matches3("(?i)b(?!a)"))              3981.841 4177.834 4792.2025 4396.3275 4655.6780 31812.876   100
# df %>% .[, grep("(?i)b(?!a)", names(.), perl = T)]  183.164  210.797  242.1678  237.2455  263.6935   554.624   100

Use perl=TRUE regex in dplyr select

3 Answers3