I'm trying to extract an "ATC code" from a text string in R using the str_extract function in R.
The code, within the string, always begins with "ATC: ", then the code itself is a combination of letters and numbers strung together.
The current output is partially working, it's just I'm struggling to match "A07AX" as if I make the digit optional then it is obviously matching even less than required
Original dataframe:
library(dplyr)
data01 <-
rbind(data.frame(text = "abc (ATC: A07BA51) fdfv"),
data.frame(text = "abc (ATC: A07AX) dsaf"),
data.frame(text = "abc (ATC: M01AE01) dff"))
text
1 abc (ATC: A07BA51) fdfv
2 abc (ATC: A07AX) dsaf
3 abc (ATC: M01AE01) dff
Code to extract the ATC group:
library(stringr)
data02 <-
data01 %>%
mutate(atc_group = gsub("ATC:|\\s", "", str_extract(text, "ATC:\\s([A-Z]+\\d+)+")))
Current output:
text atc_group
1 abc (ATC: A07BA51) fdfv A07BA51
2 abc (ATC: A07AX) dsaf A07
3 abc (ATC: M01AE01) dff M01AE01