1

I'm trying to extract an "ATC code" from a text string in R using the str_extract function in R.

The code, within the string, always begins with "ATC: ", then the code itself is a combination of letters and numbers strung together.

The current output is partially working, it's just I'm struggling to match "A07AX" as if I make the digit optional then it is obviously matching even less than required

Original dataframe:

library(dplyr)

data01 <-
  rbind(data.frame(text = "abc (ATC: A07BA51) fdfv"),
        data.frame(text = "abc (ATC: A07AX) dsaf"),
        data.frame(text = "abc (ATC: M01AE01) dff"))

                     text
1 abc (ATC: A07BA51) fdfv
2   abc (ATC: A07AX) dsaf
3  abc (ATC: M01AE01) dff

Code to extract the ATC group:

library(stringr)

data02 <-
  data01 %>%
  mutate(atc_group = gsub("ATC:|\\s", "", str_extract(text, "ATC:\\s([A-Z]+\\d+)+"))) 

Current output:

                     text atc_group
1 abc (ATC: A07BA51) fdfv   A07BA51
2   abc (ATC: A07AX) dsaf       A07
3  abc (ATC: M01AE01) dff   M01AE01
Sam Gilbert
  • 1,642
  • 3
  • 21
  • 38

1 Answers1

1

Assuming that we are using dplyr, we extract characters that are not ) and that follows the regex lookaround (?<=ATC:\\s).

library(dplyr)
library(stringr)
data01 %>% 
      mutate(atc_group=str_extract(text, '(?<=ATC:\\s)[^)]+'))
#                    text atc_group
#1 abc (ATC: A07BA51) fdfv   A07BA51
#2   abc (ATC: A07AX) dsaf     A07AX
#3  abc (ATC: M01AE01) dff   M01AE01

Or we can use extract from library(tidyr). We capture (inside the parentheses) the alpha numeric characters that follow ATC: followed by one or more space (\\s+).

library(tidyr)
extract(data01, text, into='atc_group', 
         '.*\\(ATC:\\s+([[:alnum:]]+)\\).*', remove=FALSE)
#                     text atc_group
#1 abc (ATC: A07BA51) fdfv   A07BA51
#2   abc (ATC: A07AX) dsaf     A07AX
#3  abc (ATC: M01AE01) dff   M01AE01

We can also gsub to extract the substring

gsub('.*ATC:\\s+|\\).*', '', data01$text)
#[1] "A07BA51" "A07AX"   "M01AE01"
akrun
  • 874,273
  • 37
  • 540
  • 662