5

I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.

For example, I have one column and a string of data like this:

Text
Section 12345.01 to section 12345.02

And I want to create two new columns from the data in the Text column, like this:

Min        Max   
12345.01   12345.02

I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).

df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))

If I try to use the stringr::str_extract_all function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.

Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric and then run min() and max() functions.

acylam
  • 18,231
  • 5
  • 36
  • 45
Seth Brundle
  • 160
  • 7

3 Answers3

6

With extract from tidyr. extract turns each regex capture group into its own column. convert = TRUE is convenient in that it coerces the resulting columns to the best format. remove = FALSE can be used if we want to keep the original column. The last mutate is optional to make sure that the first number extracted is really the minimum:

library(tidyr)
library(purrr)

df %>%
  extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
  mutate(Min = pmap_dbl(., min),
         Max = pmap_dbl(., max))

Output:

       Min      Max
1 12345.02 12345.03

Data:

df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA, 
-1L), .Names = "Text")
acylam
  • 18,231
  • 5
  • 36
  • 45
  • 3
    Great answer! (+1). First I've seen the `tidyr::extract()` function, and it's awesome. Although, I will say that this could run into problems for OP if the min and max are not predictably the first and second matches. – duckmayr Sep 24 '18 at 19:51
  • I'll note that this approach is concise but works only when there are just the two numbers in the `Text` column, and they are ordered with the first being smaller. Unclear from OP whether this is the case – Calum You Sep 24 '18 at 19:56
  • Thanks for this solution. The `tidyr::extract()` function is really cool. It's also the first I've seen it. Typically the section numbers will be in the correct order, but it's possible there could be a situation where they aren't. – Seth Brundle Sep 24 '18 at 20:15
  • @RyanMinaker All we needed to do is add the last `mutate` step to make sure that the order does not mess up the columns. See my update. – acylam Sep 24 '18 at 20:16
5

Using some other tidyverse tools, you can either approach this by unnesting the list-column and using group_by and summarise semantics (the more dplyr way), or you can just deal with the list-col as-is and use map_dbl to extract the max and min from each row (a more purrr way). My benchmarks have map_dbl about 7 times faster than unnest and dplyr, and about 15% faster than extract, though this is only on the one row.

library(tidyverse)
df <- tibble(
  Text = c("Section 12345.01 to section 12345.02")
)

df %>%
  mutate(SectionNum = str_extract_all(Text, "\\d+\\.\\d+")) %>%
  unnest %>%
  group_by(Text) %>%
  summarise(min = min(as.numeric(SectionNum)), max = max(as.numeric(SectionNum)))
#> # A tibble: 1 x 3
#>   Text                                    min    max
#>   <chr>                                 <dbl>  <dbl>
#> 1 Section 12345.01 to section 12345.02 12345. 12345.

df %>%
  mutate(
    SectionNum = str_extract_all(Text, "\\d+\\.\\d+"),
    min = map_dbl(SectionNum, ~ min(as.numeric(.x))),
    max = map_dbl(SectionNum, ~ max(as.numeric(.x)))
  )
#> # A tibble: 1 x 4
#>   Text                                 SectionNum    min    max
#>   <chr>                                <list>      <dbl>  <dbl>
#> 1 Section 12345.01 to section 12345.02 <chr [2]>  12345. 12345.

Created on 2018-09-24 by the reprex package (v0.2.0).

Calum You
  • 14,687
  • 4
  • 23
  • 42
2

There have already been answers which say how to accomplish your final goal as asked in the question, but just to address the question of how you can find the first or second match using the stringr package, you can use the str_match function, and specify the specific match you are interested in by referring to the column of str_match.

library(stringr)

Text <- "Section 12345.01 to section 12345.02"

str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[2]
#> [1] "12345.01"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[3]
#> [1] "12345.02"

Created on 2018-09-24 by the reprex package (v0.2.0).

Kerry Jackson
  • 1,821
  • 12
  • 20