1

So I have a column called chemical formula for like 40,000 entries, and what I want to be able to do is count up how many elements are contained in the chemical formula. So for example:-

EXACT_MASS  FORMULA
626.491026  C40H66O5
275.173274  C13H25NO5

For this, I need some kind of formula that will return with the result of

C  H  O
40 66 5
13 25 5

all as separate columns for the different elements and in rows for the different entries. Is there a formula that can do this?

user2062207
  • 955
  • 4
  • 18
  • 34
  • Are the string fixed, I mean, does every entry has an alphabet then two number, an alphabet then two numbers, an alphabet and the rest of numbers? – Vasim Oct 14 '14 at 10:20
  • So every entry has an alphabet, but the numbers can vary. So for example you could get C10H60 or you could get CH4 which should count the C as 1 or even CH3O which should count the C and O as 1 and the H as 3 – user2062207 Oct 14 '14 at 10:27
  • K, but everything is standard, as in always starts with C followed by H and then O. Right? – Vasim Oct 14 '14 at 10:29
  • So basically it has to count C, H, O or any chemical element that may be present in the chemical formula. For example, H2CO3 would come out as H - 2, C - 1 and O - 3 – user2062207 Oct 14 '14 at 11:22
  • 1
    When working on this problem, be careful with case-sensitivity. CO2 (carbon dioxide) is C-1, O-2. Co2 (cobalt) is Co-2. – Rick Hitchcock Oct 14 '14 at 12:47

3 Answers3

4

You could make your own formula.

Open the VBA editor with ALT and F11 and insert a new module.

Add a reference to Microsoft VBScript Regular Expressions 5.5 by clicking Tools, then references.

Now add the following code:

Public Function FormulaSplit(theFormula As String, theLetter As String) As String

    Dim RE As Object
    Set RE = CreateObject("VBScript.RegExp")
    With RE
        .Global = True
        .MultiLine = False
        .IgnoreCase = False
        .Pattern = "[A-Z]{1}[a-z]?"
    End With

    Dim Matches As Object
    Set Matches = RE.Execute(theFormula)

    Dim TheCollection As Collection
    Set TheCollection = New Collection

    Dim i As Integer
    Dim Match As Object
    For i = (Matches.Count - 1) To 0 Step -1
        Set Match = Matches.Item(i)
        TheCollection.Add Mid(theFormula, Match.FirstIndex + (Len(Match.Value) + 1)), UCase(Trim(Match.Value))
        theFormula = Left(theFormula, Match.FirstIndex)
    Next

    FormulaSplit = "Not found"
    On Error Resume Next
    FormulaSplit = TheCollection.Item(UCase(Trim(theLetter)))
    On Error GoTo 0

    If FormulaSplit = "" Then
        FormulaSplit = "1"
    End If

    Set RE = Nothing
    Set Matches = Nothing
    Set Match = Nothing
    Set TheCollection = Nothing

End Function

Usage:

  • FormulaSplit("C40H66O5", "H") would return 66.
  • FormulaSplit("C40H66O5", "O") would return 5.
  • FormulaSplit("C40H66O5", "blah") would return "Not found".

You can use this formula directly in your workbook. Example of FormulaSplit usage in Workbook

Jamie Dunstan
  • 3,725
  • 2
  • 24
  • 38
  • Thank you for your help! I just gave it a go now, basically copied and pasted it in but I keep getting the following reply:- User-defined type is not defined – user2062207 Oct 14 '14 at 11:40
  • @user2062207 - did you add the reference to the `Microsoft VBScript Regular Expressions 5.5` library? In the VBA editor, click tools, then references. Then find `Microsoft VBScript Regular Expressions 5.5` in the list. Tick it and then press OK. – Jamie Dunstan Oct 14 '14 at 11:49
  • 1
    @user2062207 I have updated Jamie's Code to have late binding. I believe when Jamie approve it, you should be able to see the updated code and then you can check that. – Jur Pertin Oct 14 '14 at 12:02
  • @JurPertin - thanks for the help. I couldn't directly accept your edit but I have taken your idea on board and edited the answer. Thanks! – Jamie Dunstan Oct 14 '14 at 12:29
  • apologies for the late reply, the internet went down across our labs. Shall do now! – user2062207 Oct 14 '14 at 14:15
  • So I should of probably mentioned this before, its my fault! But there's some elements that will contain the same letters but are completely different. So for example there's C - carbon, but Cl - chlorine, and so yeah when I come across formulas such as C22H45ClO8S2 it messes up. Is there a way of somehow factoring that in? – user2062207 Oct 14 '14 at 14:39
  • I guess the best way of differentiating between the characters would be that every new element in the formula begins with a captial, so C22H44Cl2O8S2 has a lowercase l as Cl stands for chlorine, but every other individual element is captialised – user2062207 Oct 14 '14 at 14:41
3

I've had a stab at doing this in a formula nad come up with the following:

=IFERROR((MID($C18,FIND(D17,$C18)+1,2))*1,IFERROR((MID($C18,FIND(D17,$C18)+1,1))*1,IFERROR(IF(FIND(D17,$C18)>0,1),0)))

It's not very neat and would have to be expanded further if any of your elements are going to appear more than 99 times - I also used a random placement on my worksheet so the titles H,C and O are in row 17. I would personally go with Jamie's answer but just wanted to try this to see if I could do it in a formula possible and figured it was worth sharing just as another perspective.

Dave
  • 1,643
  • 1
  • 9
  • 9
  • Thank you nonetheless for helping! I'll give it a go to and see how it goes – user2062207 Oct 14 '14 at 14:15
  • This works perfectly for the two examples, so it gets my vote. Note that it will not work if an element has two characters. For CO2, it correctly returns 1 for carbon and 2 for oxygen. But for Co2, it incorrectly returns 1 for carbon and 1 for cobalt. I don't think you can do this with a function without using an array formula. – Rick Hitchcock Oct 17 '14 at 14:58
1

Even though this has an excellent (and accepted) VBA solution, I couldn't resist the challenge to do this without using VBA.

I posted a solution earlier, which wouldn't work in all cases. This new code should always work:

=MAX(
   IFERROR(IF(FIND(C$1&ROW($1:$99),$B2),ROW($1:$99),0),0),
   IFERROR(IF(FIND(C$1&CHAR(ROW($65:$90)),$B2&"Z"),1,0),0)
 )

Enter as an array formula: Ctrl + Shift + Enter

Output:

enter image description here

The formula outputs 0 when not found, and I simply used conditional formatting to turn zeroes gray.

How it works

This part of the formula looks for the element, followed by a number between 1 and 99. If found, the number of atoms is returned. Otherwise, 0 is returned. The results are stored in an array:

IFERROR(IF(FIND(C$1&ROW($1:$99),$B2),ROW($1:$99),0),0)

In the case of C13H25NO5, a search for "C" returns this array:

{1,0,0,0,0,0,0,0,0,0,0,0,13,0,0,0,...,0}

1 is the first array element, because C1 is a match. 13 is the thirteenth array element, and that's what we're interested in.

The next part of the formula looks for the element, followed by an uppercase letter, which indicates a new element. (The letters A through Z are characters 65 through 90.) If found, the number 1 is returned. Otherwise, 0 is returned. The results are stored in an array:

IFERROR(IF(FIND(C$1&CHAR(ROW($65:$90)),$B2&"Z"),1,0),0)

"Z" is appended to the chemical formula, so that a match will be found when its last element has no number. (For example, "H2O".) There is no element "Z" in the Periodic Table, so this won't cause a problem.

In the case of C13H25NO5, a search for "N" returns this array:

{0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0}

1 is the 15th element in the array. That's because it found the letters "NO", and O is the 15th letter of the alphabet.

Taking the maximum value from each array gives us the number of atoms as desired.

Rick Hitchcock
  • 35,202
  • 5
  • 48
  • 79