How do I split strings into number and the remaining string using stringr in r?

Question

I would like to split strings in my dataframe using stringr.

The following is my dataframe:

df<-data.frame(ID = 1:26, 
           DRUG_STRENGTH = c("50 MG", "1250 MG", "20 MG", "200 MG", "2MG", "60MG", NA, "300IU", 
                             NA, "600 MG", "500MG", "625MG", NA, NA, "50MG/ML", "40MG", "200MG", 
                             "200MG", "200MG", "5 MG", "5 MG", "200MG", "300IU/3ML", "0.05%", 
                             "112.5 BILLION", "10.8MG"))

My desired dataframe is:

# > df
#   ID DRUG_STRENGTH DRUG_STRENGTH_NO DRUG_STRENGTH_UNIT
# 1   1         50 MG               50                 MG
# 2   2       1250 MG             1250                 MG
# 3   3         20 MG               20                 MG
# 4   4        200 MG              200                 MG
# 5   5           2MG                2                 MG
# 6   6          60MG               60                 MG
# 7   7          <NA>             <NA>               <NA>
# 8   8         300IU              300                 IU
# 9   9          <NA>             <NA>               <NA>
# 10 10        600 MG              600                 MG
# 11 11         500MG              500                 MG
# 12 12         625MG              625                 MG
# 13 13          <NA>             <NA>               <NA>
# 14 14          <NA>             <NA>               <NA>
# 15 15       50MG/ML               50              MG/ML
# 16 16          40MG               40                 MG
# 17 17         200MG              200                 MG
# 18 18         200MG              200                 MG
# 19 19         200MG              200                 MG
# 20 20          5 MG                5                 MG
# 21 21          5 MG                5                 MG
# 22 22         200MG              200                 MG
# 23 23     300IU/3ML              300             IU/3ML
# 24 24         0.05%             0.05                  %
# 25 25 112.5 BILLION            112.5            BILLION
# 26 26        10.8MG             10.8                 MG

My code gives me my desired df but I would like to ask if there is a nicer way to write the regular expressions.

df <- df %>%
  mutate(DRUG_STRENGTH_NO = str_extract(DRUG_STRENGTH, pattern = "^\\d\\.?\\d?\\.?\\d?\\.?\\d*"),
         DRUG_STRENGTH_UNIT = str_trim(str_replace(DRUG_STRENGTH, pattern = "^\\d\\.?\\d?\\.?\\d?\\.?\\d*", replacement = "")))

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2018-02-02T09:33:04.123

I'd use extract for this:

library(tidyverse)
df %>% 
  extract(DRUG_STRENGTH, into = c("No", "Unit"), "([0-9.]+)(.*)", remove = FALSE)
##    ID DRUG_STRENGTH    No     Unit
## 1   1         50 MG    50       MG
## 2   2       1250 MG  1250       MG
## 3   3         20 MG    20       MG
## 4   4        200 MG   200       MG
## 5   5           2MG     2       MG
## 6   6          60MG    60       MG
## 7   7          <NA>  <NA>     <NA>
## 8   8         300IU   300       IU
## 9   9          <NA>  <NA>     <NA>
## 10 10        600 MG   600       MG
## 11 11         500MG   500       MG
## 12 12         625MG   625       MG
## 13 13          <NA>  <NA>     <NA>
## 14 14          <NA>  <NA>     <NA>
## 15 15       50MG/ML    50    MG/ML
## 16 16          40MG    40       MG
## 17 17         200MG   200       MG
## 18 18         200MG   200       MG
## 19 19         200MG   200       MG
## 20 20          5 MG     5       MG
## 21 21          5 MG     5       MG
## 22 22         200MG   200       MG
## 23 23     300IU/3ML   300   IU/3ML
## 24 24         0.05%  0.05        %
## 25 25 112.5 BILLION 112.5  BILLION
## 26 26        10.8MG  10.8       MG

You may need to go back through and check for any whitespace later.

Nice strategy +1 ... this was easier than I thought it would be. — Tim Biegeleisen, Feb 02 '18 at 09:32

score 0 · Answer 2 · answered Feb 02 '18 at 10:45

Or, if you make sure the number and the remainder are separated by say, a space, you could use strsplit or str_split (with or without simplify). Using regular expressions might prove to be more flexible, but can also turn messy in more complicated situations.

How do I split strings into number and the remaining string using stringr in r?

2 Answers2