0

I am trying to extract the number of carbons, hydrogens, and oxygens from a chemical formula. I previously found some code that I have been trying to use. The problem is that the code only works when the chemical formula has more than 1 of the element.

V <- DATA # example: CH4O, H2O, C10H18O2
# V is a data.frame

C1 <- as.integer(sub("(?i).*?C:?\\s*(\\d+).*", "\\1", V))
# NA NA 10
H1 <- as.integer(sub("(?i).*?H:?\\s*(\\d+).*", "\\1", V))
# 4 2 18
O1 <- as.integer(sub("(?i).*?O:?\\s*(\\d+).*", "\\1", V))
# NA NA 2

I am currently using

is.na(C1) <- 1

to get the NA's changed to 1's and then manually changing the 0 values. Is there a more efficient code I can use to get the proper counts of the elements in the chemical formulas (specificially in the cases that the value is 0 or 1 and causing NA results). Let me know if you need more information or if I should change some of the format.

EDIT: The desired values would be to get all the correct counts without the NA and manually changing values to 0 if possible.

C1
# 1 0 10
H1
# 4 2 18
O1
# 1 1 2

EDIT2: Here is an example of the actual data I am importing

Meas. m/z   #   Ion Formula Score   m/z err [mDa]   err [ppm]   mSigma  rdb e¯ Conf Adduct  
84.080700   1   C5H10N  n.a.    84.080776   0.1 0.9 n.a.    2.0 even        
89.060100   1   C4H9O2  n.a.    89.059706   -0.4    -4.4    n.a.    1.0 even        
131.987800  1   C2H4N3P2    n.a.    131.987498  -0.3    -2.3    n.a.    6.0 even        
135.081100  1   C9H11O  n.a.    135.080441  -0.7    -4.9    n.a.    5.0 even        
135.117500  1   C10H15  n.a.    135.116827  -0.7    -5.0    n.a.    4.0 even        
136.061700  1   C5H6N5  n.a.    136.061772  0.1 0.5 n.a.    6.0 even        

In the initial question i just listed V as coming from a vector of forlumas, but what I actually have is a data.frame with other information and I use V[,3] when performing the calculations to get the column of interest.

akrun
  • 874,273
  • 37
  • 540
  • 662
TCB at UGA
  • 109
  • 6
  • Is it a single string. Please show the expected output – akrun Mar 17 '20 at 18:10
  • No, I am pulling the formulas from a data.frame containing other information as well. So it is a single column that I am interested in. – TCB at UGA Mar 17 '20 at 18:13
  • Can you update with the expected output. THe values you are getting seems correct to me – akrun Mar 17 '20 at 18:14
  • I updated the post with the desired outcome. Currently the outcome is correct for the code I am using, was just hoping to improve it to take out the manual checks for 0 values. – TCB at UGA Mar 17 '20 at 18:18
  • Could you please provide sample input in valid R syntax? Something like `V = data.frame(val = c("CH4O", "H2O", "C10H18O2"))`, if that is indeed what `V` looks like? I'm surprised, because you say `V` is a data frame, but you use `V` in the code as if it's a vector. In addition to clarifying your data structure, it would also give everyone a standard set of inputs to test on, which tends to drastically speed up how quickly people will answer and make sure the results are correct. – Gregor Thomas Mar 17 '20 at 18:41
  • I added what a few rows of the actual data I am importing looks like. Sorry for the confusion and let me know if I need to add/change anything else. – TCB at UGA Mar 17 '20 at 18:52

3 Answers3

2

Here's an alternative:

vec <- c("CH4O", "H2O", "C10H18O2", "C2H4N3P2")

molecules <- regmatches(vec, gregexpr("\\b[A-Z][a-z]*\\d*", vec))
molecules <- lapply(molecules, function(a) paste0(a, ifelse(grepl("[^0-9]$", a), "1", "")))

atomcounts <- lapply(molecules, function(mol) setNames(as.integer(gsub("\\D", "", mol)), gsub("\\d", "", mol)))

atoms <- unique(unlist(sapply(atomcounts, names)))
atoms <- sapply(atoms, function(atom) sapply(atomcounts, function(a) if (atom %in% names(a)) a[atom] else 0))
rownames(atoms) <- vec
atoms
#           C  H O N P
# CH4O      1  4 1 0 0
# H2O       0  2 1 0 0
# C10H18O2 10 18 2 0 0
# C2H4N3P2  2  4 0 3 2
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you for the reply. This code will actually work great for what I am trying to do, since it requires the data to be put into a matrix afterwards. Thanks for the help/effort! – TCB at UGA Mar 17 '20 at 19:13
1

Perhaps not the most elegant code I've ever written, but for a given chemical formula, this will return the counts of each element in a labeled vector. The intermediate step, counts.per.equation returns the count of each element in each equation.

library(stringr)

extract <- str_extract_all(c('CH4O', 'H2O', 'C10H18O2'), '\\D\\d*')

counts.per.equation <- lapply(extract, function(x) {
  elements <- str_extract_all(x, '\\D+')
  counts <- str_extract_all(x, '\\d+', simplify = T)
  counts[counts == ''] <- 1
  counts <- as.numeric(counts)
  names(counts) <- elements
  return(counts)
})

total.counts <- Reduce(function(x, y) {
  names <- union(names(x), names(y))
  counts <- sapply(names, function(z) sum(x[z], y[z], na.rm = T))
  names(counts) <- names
  return(counts)
} , counts.per.equation)

 C  H  O 
11 24  4 
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • Thank you for the reply. This code works to solve for the data as well. The data is more difficult to work with afterwards since it isn't tabulated like the other reply. But it does answer what was being asked! – TCB at UGA Mar 17 '20 at 19:12
1

We can use base R methods with strsplit and xtabs

out <- do.call(rbind, Map(cbind,
   lapply(strsplit(gsub("(?<=[A-Z])(?![0-9])", "1", vec,
   perl = TRUE), "(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])", 
    perl = TRUE), function(x) data.frame(key = x[c(TRUE, FALSE)], 
       value = as.numeric(x[c(FALSE, TRUE)]))), grp = vec))

xtabs(value ~ grp + key, out)
#         key
#grp         C  H  O  N  P
#  CH4O      1  4  1  0  0
#  H2O       0  2  1  0  0
#  C10H18O2 10 18  2  0  0
#  C2H4N3P2  2  4  0  3  2

Or with tidyverse

library(stringr)
library(dplyr)
library(tidyr)
tibble(col1 = vec,
       col2 = str_replace_all(col1, "(?<=[A-Z])(?![0-9])", "1")) %>%
       separate_rows(col2, sep= "(?<=[A-Z])(?=[0-9])|(?<=[0-9])(?=[A-Z])") %>%
       group_by(col1) %>% 
       summarise(key = list(col2[c(TRUE, FALSE)]),
            val = list(col2[c(FALSE, TRUE)])) %>%
       unnest(c(key, val)) %>% 
   pivot_wider(names_from = key, values_from = val, values_fill = list(val = 0))
# A tibble: 4 x 6
#  col1     C     H     O     N     P    
#  <chr>    <chr> <chr> <chr> <chr> <chr>
#1 C10H18O2 10    18    2     0     0    
#2 C2H4N3P2 2     4     0     3     2    
#3 CH4O     1     4     1     0     0    
#4 H2O      0     2     1     0     0    

data

vec <- c("CH4O", "H2O", "C10H18O2", "C2H4N3P2")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you for the reply and the effort. The code works well for my data and extracts the number of elements as requested. – TCB at UGA Mar 17 '20 at 19:43