How to parse chemical formula to get atom component?

Question

My data set is stored in a single column table named "Formula" which looks like this:

row.identity..main.ID.
C5H6O2N3
C10H12N
C5H6O2N3S

I want to extend the current table, where in each column the letters are written and in the line below the coresponding number is shown. Basically I want to have something like this:

row.identity..main.ID.   C  H  O  N  S  X
C5H6O2N3                 5  6  2  3  0  0   
C10H12N                 10 12  0  1  0  0
C5H6O2N3S                5  6  2  3  1  0

It would be great when the code is flexible for even longer data set with variating letters. So far, I tried to implement the solution from Onyambu.

library(tidyverse)
library(stringr)    
Formula%>%mutate(row.identity..main.ID.=gsub("\\b([A-Za-z]+)\\b","\\30",row.identity..main.ID.),
               elements=str_extract_all(row.identity..main.ID.,"[A-Za-z]+"),
               value=str_extract_all(row.identity..main.ID.,"\\d+"))%>%
          unnest()%>%pivot_wider(elements,value,fill=0)

However this is resulting in several errors like "Incompatible lengths: 4, 3." and/or cols is now required when using unnest().

`OrgMassSpecR::ListFormula` and then some massage to use it on a data.frame. — Jan Stanstrup, Sep 19 '22 at 07:40

Onyambu · Answer 1 · 2021-04-15T17:38:42.893

You could also do:

a<- sub("([A-Z]$)","\\1:1", gsub("(\\D+)(\\d+)", "\\1:\\2\n",df[,1]))
e <- sapply(a, function(x)data.frame(read.dcf(textConnection(x))))
f <- cbind(df, plyr::rbind.fill(e))
f[is.na(f)] <- 0
f

  row.identity..main.ID.  C  H O N S
1               C5H6O2N3  5  6 2 3 0
2                C10H12N 10 12 0 1 0
3              C5H6O2N3S  5  6 2 3 1

Another option is to convert the text to Json then read it into R:

a <- gsub("(\\D)(\\d+)", '"\\1":\\2,', df[,1])
b <- gsub("([A-Z])$", '"\\1":1', trimws(a, whitespace = ","))

cbind(df, jsonlite::fromJSON(sprintf("[{%s}]",paste(b, collapse = "}, {"))))
replace(f, is.na(f), 0)

  row.identity..main.ID.  C  H O N S
1               C5H6O2N3  5  6 2 3 0
2                C10H12N 10 12 0 1 0
3              C5H6O2N3S  5  6 2 3 1

Brilliant answer with `read.dcf` + `textConnection`, upvoted! — ThomasIsCoding, Apr 15 '21 at 16:23
Definitely an esoteric base R function, but remarkably compact. And it works. — Ian Campbell, Apr 15 '21 at 16:49

ThomasIsCoding · Accepted Answer · 2021-04-15T15:22:06.273

You can try the code below

df <- cbind(
  df,
  do.call(
    rbind,
    Map(function(x) {
      x <- gsub("(?<=[A-z])(?![0-9])","1",x,perl = TRUE)
      table(
        factor(rep(
          gsub("\\d+", "", x),
          as.numeric(gsub("\\D+", "", x))
        ), levels = c("C", "H", "O", "N", "S", "X"))
      )
    }, regmatches(df$ID, gregexpr("[A-z]+(\\d+)?", df$ID)))
  )
)

which gives

> df
         ID  C  H O N S X
1  C5H6O2N3  5  6 2 3 0 0
2   C10H12N 10 12 0 1 0 0
3 C5H6O2N3S  5  6 2 3 1 0

Data

> dput(df)
structure(list(ID = c("C5H6O2N3", "C10H12N", "C5H6O2N3S"), C = c(5L, 
10L, 5L), H = c(6L, 12L, 6L), O = c(2L, 0L, 2L), N = c(3L, 1L,
3L), S = c(0L, 0L, 1L), X = c(0L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))

Ian Campbell · Answer 3 · 2021-04-15T16:42:43.997

Here's an approach with purrr:

library(purrr); library(dplyr); library(stringr)
Formula %>% 
  pmap_dfr(~map2_dbl(c(...),LETTERS,
                     ~max((str_extract(.x,paste0("(?<=",.y,")[0-9]+"))%>% as.integer),
                          (str_extract(.x,.y) == .y),
                          na.rm = TRUE)) %>%
             replace(is.infinite(.),0) %>%
             set_names(LETTERS)) %>%
    select_if(~sum(.) > 0)
# A tibble: 3 x 5
      C     H     N     O     S
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     5     6     3     2     0
2    10    12     1     0     0
3     5     6     3     2     1

Data

Formula <- structure(list(row.identity..main.ID. = c("C5H6O2N3", "C10H12N", 
"C5H6O2N3S")), class = "data.frame", row.names = c(NA, -3L))

How to parse chemical formula to get atom component?

3 Answers3