Extract numbers from Chemical Formula (missing the number of 1) in R

Question

Although i have try the answer from @Onyambu at "Extract numbers from Chemical Formula in R", but the new problem was coming. The reference code is as following,

library(tidyverse)
library(stringr)  



dat%>%mutate(Composition=gsub("\\b([A-Za-z]+)\\b","\\11",Composition),
          name=str_extract_all(Composition,"[A-Za-z]+"),
          value=str_extract_all(Composition,"\\d+"))%>%
unnest()%>%spread(name,value,fill=0)
   m.z Intensity Relative Delta..ppm. RDB.equiv.    Composition  C  H Na O
1 149.0233   4083459    23.60       -0.08        6.5       C8 H5 O3  8  5  0 3
2 279.1591        NA    18.64       -0.03        5.5     C16 H23 O4 16 23  0 4
3 301.1409        NA   100.00       -0.34        5.5 C16 H22 O4 Na1 16 22  1 4

For example. My question is how to process the formula like this, "C7H5NO4"? I only got the ("C" "H" "NO") and ("7" "5" "4"); the right form is the ("C" "H" "N","O") and ("7" "5" ,"1","4").

if we can insert 1 into the "N" and "O"， the problem may be solved. I do not know how to handle it.

Thanks

Hees

Because you seem happy to use non-`base` packages, you may try `CHNOSZ::makup`, as described e.g. [here](https://stackoverflow.com/a/42677087/1851712) (`L <- Map(makeup, strings)`; "L is a list of the fully parsed formulas") — Henrik, Feb 15 '21 at 14:03
Not sure if I get the question but if you want to match all letters/numbers individually, just get rid of the `+` in both regex — JBGruber, Feb 15 '21 at 14:04
@JBGruber, Yes, i try to delete the `+`, it is ok for this case. but if there is `Na` in the formula. The `Na` will be split into `N` and `a`. So, it is complex for me. Thanks — hees, Feb 15 '21 at 14:15

score 0 · Answer 1 · answered Feb 15 '21 at 15:35

Based on your other question, how about:

library(tidyverse)
library(tidytext)
df %>% 
  unnest_tokens(part, Composition, to_lower = FALSE) %>%
  mutate(col = str_extract(part, "[A-z]+"),
         val = as.integer(str_extract(part, "\\d+"))) %>% 
  select(-part) %>% 
  mutate(val = ifelse(is.na(val), 1, val)) %>% 
  pivot_wider(names_from = col, 
              values_from  = val, 
              values_fill = 0)
#> # A tibble: 3 x 9
#>     m.z Intensity Relative Delta..ppm. RDB.equiv.     C     H     O    Na
#>   <dbl>     <dbl>    <dbl>       <dbl>      <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  301.       NA     100         -0.34        5.5    16    22     4     1
#> 2  149.  4083458.     23.6       -0.08        6.5     8     5     3     0
#> 3  279.       NA      18.6       -0.03        5.5    16    23     4     0

This utilises the space between items in Composition. unnest_tokens from tidytext separates them in new rows before we separate them into different columns.

Extract numbers from Chemical Formula (missing the number of 1) in R

1 Answers1