Regular expression matching inside dplyr

Question

When answering this question, I wrote the following code:

df <- data.frame(Call_Num = c("HV5822.H4 C47 Circulating Collection, 3rd Floor", "QE511.4 .G53 1982 Circulating Collection, 3rd Floor", "TL515 .M63 Circulating Collection, 3rd Floor", "D753 .F4 Circulating Collection, 3rd Floor", "DB89.F7 D4 Circulating Collection, 3rd Floor"))

require(stringr)

matches = str_match(df$Call_Num, "([A-Z]+)(\\d+)\\s*\\.")
df2 <- data.frame(df, letter=matches[,2], number=matches[,3])

Now my question is: Is there a simple way to combine the last two lines into one dplyr call, presumably using mutate()? Alternatively, I'd interested in a solution with do() as well. For the mutate() approach, since we're extracting 2 groups, I'll take a solution that calls str_match() twice with different regular expressions, one for each desired group.

Edit: To clarify, the main challenge I see here is that str_match returns a matrix, and I'm wondering how to handle that in mutate() or do(). I'm not interested in solutions to the original problem using other methods of extracting the information. There are plenty of such solutions given already here.

score 7 · Answer 1 · answered Jul 07 '15 at 13:57

You could do this with extract() from the tidyr package:

extract(df, Call_Num, into = c("letter", "number"), regex = "([A-Z]+)(\\d+)\\s*\\.", remove = FALSE)

                                             Call_Num letter number
1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
4          D753 .F4 Circulating Collection, 3rd Floor      D    753
5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

It's not dplyr, but as stated on the CRAN page linked above, tidyr "is designed specifically for data tidying (not general reshaping or aggregating) and works well with dplyr data pipelines."

Thanks. This doesn't address the main issue of the question, though. My question is not about solving the problem of extracting the information, it's about how to handle a function that returns a matrix when using `mutate` (or `do`). — Claus Wilke, Jul 07 '15 at 14:04

akrun · Accepted Answer · 2015-07-07T15:10:58.623

You can try with do

df %>% 
  do(data.frame(., str_match(.$Call_Num,  "([A-Z]+)(\\d+)\\s*\\.")[,-1],
                              stringsAsFactors=FALSE)) %>%
  rename_(.dots=setNames(names(.)[-1],c('letter', 'number')))
#                                             Call_Num letter number
#1     HV5822.H4 C47 Circulating Collection, 3rd Floor     HV   5822
#2 QE511.4 .G53 1982 Circulating Collection, 3rd Floor     QE    511
#3        TL515 .M63 Circulating Collection, 3rd Floor     TL    515
#4          D753 .F4 Circulating Collection, 3rd Floor      D    753
#5        DB89.F7 D4 Circulating Collection, 3rd Floor     DB     89

Or as @SamFirke commented, renaming the columns can be also done with

  ---                                    %>%
 setNames(., c(names(.)[1], "letter", "number"))

The last line could also be simply `setNames(., c(names(.)[1], "letter", "number"))` — Sam Firke, Jul 07 '15 at 15:06

Regular expression matching inside dplyr

2 Answers2