How do you find similarities and changes within strings in two data frames?

Question

Hi I have a 2 large datasets and I'm trying to identify changes within molecular compositions. I have two dataframes... one that is pre-exposed to a reagent and another that has been exposed to this reagent (reagent = NaNO3).

The Formulas have a base structure CxHyOzNd (x,y,z, and d being variables). From the exposed sample to the pre-exposed sample data sets we will see a loss of -N1O2 and +H1 (The C should be the same within the matching formulas. The code should be able to find this loss through the elemental columns (C,H,O,N,S,P) or through the strings itself (e.g. C13H26O2N).

Ex. Pre-Exposed Sample

Composition  C  H  O  N  S  P
C11H13O2     11 13 2  0  0  0
C7H9O        7  9  1  0  0  0
C4H8         4  8  0  0  0  0
.....

Ex. Exposed Sample

Composition  C  H  O  N  S  P
C11H12O4N    11 12 4  1  0  0
C7H7O5N2     7  7  5  2  0  0
C3H6O        3  6  1  0  0  0
.....

As seen in the data frames the molecular compositions changed by the addition of +NO2 and a loss of one Hydrogen (-H). I want to be able to match these formula strings if they experience the loss of -N1O2 and +H and make a new data frame that connects these two formulas together. An example is shown below. The NO2 column shows the number of NO2 groups added and the H column shows the number of H lost. (It's possible that more than 1 NO2 group could be added).

Ex. New dataframe showing the transformations

Pre-Comp    Exposed-Comp  NO2  H
C11H13O2    C11H12O4N     1    1
C7H9O       C7H7O5N2      2    2

If this type of analysis could be done through R, please let know. I'm guessing that the code would be some sort of if/then statement that could go through the data sets. I'm relatively new to R so any code would be helpful, thanks!

Are you only interessted in adding NO2 and loosing H or are you looking for all possible changes or perhaps the minimal changes (based on what metric?)? — Martin Gal, Jul 26 '21 at 21:18
Is the transformation always ending in `NO2` or `H`. Which are the possible loss or additional molecules? — TarJae, Jul 26 '21 at 22:48
yes, I'm only interested in looking at the addition of NO2 and loss of H. The C number should be the same and that is what dictates the matches. The only difference between the strings is that they lose H and have NO2 added. — David, Jul 27 '21 at 16:01

How do you find similarities and changes within strings in two data frames?

0 Answers0