3

So I'm trying to code something to tell me the number of elements in any given compound. I'm not even sure where to start: I tried coding something but then realized that it only worked for simple compounds (or did not work at all). Here's an example of what I want:

>>> function_input : 'NaMg3Al6(BO3)3Si6O18(OH)4', 'O'

>>> function_return : 31

I've come so far in my mess of a code (IT DOESN'T WORK, it just illustrates my rough thought process):

def get_pos_from_count(string: str, letter: str):
    count = string.count(letter)
    lens = [-1]
    for i in range(count):
        lens += [string[lens[i] + 1:].index(letter) + lens[i] + 1]
    return lens[1:]



def find_number(string, letter):
    if string.count(letter) == 0: return 0
    numbers = '1234567890'
    try:
        mul1 = int(string[0])
    except ValueError:
        mul1 = 0
    mul2 = []
    sub_ = 1
    list_of_positions = get_pos_from_count(string, letter)
    for i in list_of_positions:
        try:
            sub_ += int(string[i + 1]) if string[i + 1] in numbers else 0
        except IndexError: pass
        if string[i + 1:].count(')') > string[i + 1].count('('):
            try:
                mul2 += int(string[string[i + 1:].count(')') + 1])
            except (IndexError, ValueError): pass
    return mul1 * sub_ * mul2

The approach I was trying to implement was:

  1. Find the number of occurrences of said element in compound.
  2. Find each subscript, multiplying by subscript outside bracket if said element is in bracket.
  3. Sum up all subscripts, multiply by number of compounds (first character in string)
  4. Return said number to user

But then I realized my code would either be extremely long or require recursion, which I don't know how to apply here.

If possible, I'd like a semi-working function, but a quick tip on how to approach this is also helpful!

And I don't want to use external libraries if possible.

tl;dr: This question for atomicity of elements, without external libraries (if possible).

EDIT: Yes, the question I linked does have hints on how to do this, but when I tried to make any code work for only one element, and set it's weight to 1, I ran into a host of issues I don't know to solve.

Codeman
  • 477
  • 3
  • 13
  • Don't the answers to the question you've linked already contain semi-working functions or quick tips on how to approach this? – mkrieger1 Jan 31 '22 at 11:05
  • *"But then I realized my code would either be extremely long or require recursion, which I don't know how to apply here."* Recursion is useful to handle brackets. In fact, it is very easy to handle brackets with recursion. I suggest dividing your task into two parts. First, assume there are no brackets, and write a non-recursive function that gives the correct answers for compounds that don't have brackets. Once this function is working properly, and only then, try to modify it to handle brackets using recursion. – Stef Jan 31 '22 at 11:27
  • Alternatively, parse the string first, to transform it into a nested list. `'NaMg3Al6(BO3)3Si6O18(OH)4'` --> `['Na', 'Mg', 3, 'Al', 6, ['B', 'O', 3], 3, 'Si', 6, 'O', 18, ['O', 'H'],4]`. Then discard any element you're not interested in: --> `[['O', 3], 3, 'O', 18, ['O'],4]` – Stef Jan 31 '22 at 11:31
  • You should create a [mre] to ask about the specific problem you are encountering with your current attempt. – mkrieger1 Jan 31 '22 at 11:45
  • Running your code snippet, I get `NameError: name 'sub' is not defined. Did you mean: 'sub_'?`. I assume `return mul1 * sub * mul2` should be `return mul1 * sub_ * mul2`? – Stef Jan 31 '22 at 15:49
  • @Stef yep, and thanks for the help! mkrieger1 that's the thing: I don't know how to approach the problem at all – Codeman Feb 02 '22 at 02:45

2 Answers2

2

Let us divide the task into three parts:

  • Tokenize the string into a list of elements, numbers, and brackets;
  • Parse the bracket to have a nested list with sublists;
  • Count the elements in the nested list.

Introducing my tools:

from more_itertools import split_when, pairwise
from itertools import chain
from collections import Counter

def nest_brackets(tokens, i = 0):
    l = []
    while i < len(tokens):
        if tokens[i] == ')':
            return i,l
        elif tokens[i] == '(':
            i,subl = nest_brackets(tokens, i+1)
            l.append(subl)
        else:
            l.append(tokens[i])
        i += 1
    return i,l

def parse_compound(s):
    tokens = [''.join(t) for t in split_when(s, lambda a,b: b.isupper() or b in '()' or (b.isdigit() and not a.isdigit()))]
    tokens = [(int(t) if t.isdigit() else t) for t in tokens]
    i, l = nest_brackets(tokens)
    assert(i == len(tokens)) # crash if unmatched ')'
    return l

def count_elems(parsed_compound):
    c = Counter()
    for a,b in pairwise(chain(parsed_compound, (1,))):
        if not isinstance(a, int):
            subcounter = count_elems(a) if isinstance(a, list) else {a: 1}
            n = b if isinstance(b, int) else 1
            for elem,k in subcounter.items():
                c[elem] += k * n
    return c

s = 'NaMg3Al6(B(CO2)3)3Si6O18(OH)4'

l = parse_compound(s)
print(l)
# ['Na', 'Mg', 3, 'Al', 6, ['B', ['C', 'O', 2], 3], 3, 'Si', 6, 'O', 18, ['O', 'H'], 4]


c = count_elems(l)
print(c)
# Counter({'O': 40, 'C': 9, 'Al': 6, 'Si': 6, 'Mg': 3, 'B': 3, 'Na': 1})

print(c['O'])
# 40
Stef
  • 13,242
  • 2
  • 17
  • 28
  • `print(count_elems(parse_compound('H2O'))['O'])` returns `0`, where does this problem originate? – Codeman Feb 02 '22 at 03:18
  • Alright, `H2O1` returns the right amount, I'm making a function that puts a `1` behind every element without a subscript. It would be awesome if you could integrate it though! – Codeman Feb 02 '22 at 05:09
  • @Codeman Ooooops you're right. The problem originates from `pairwise`, which ignores the last element. So the issue is not with all elements that don't have a subscript, but with the last element of the string if it doesn't have a subscript (and the last element of each parenthesized block). I'll try to find a better way to fix it. – Stef Feb 02 '22 at 09:21
  • @Codeman The most immediate solution I can think of is to replace `pairwise(parsed_compound)` with `pairwise(chain(parsed_compound,(1,)))`. – Stef Feb 02 '22 at 09:33
  • Aight, accepted answer. What does `chain(parsed_compound,(1,))` do, though? I'm not familiar with any of the itertools modules except `split_when`. – Codeman Feb 02 '22 at 10:03
  • @Codeman It chains iterables. `(1,)` is a tuple with one element. `chain((1,2,3), (4,), (5,6,7))` would be the same thing as `iter((1,2,3,4,5,6,7))`. Equivalently I could have written `pairwise(parsed_compound + [1])`, but that would have constructed the list `parsed_compound + [1]`, which is somewhat inefficient. – Stef Feb 02 '22 at 12:17
0

Try with this recursive function:

import re

def find_number(s, e):
    if s == '':
        return 0
    x = re.search('\(([A-Za-z0-9]*)\)([0-9]*)', s)
    if x is not None:
        return (find_number(s[:x.start()] + s[x.end():], e) +
                find_number(x.groups()[0], e) * int(x.groups()[1] or '1'))
    return sum(int(x.groups()[0] or '1')
               for x in re.finditer(f'{e}([0-9]*)', s))

Example:

>>> find_number('NaMg3Al6(BO3)3Si6O18(OH)4', 'O')
31
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50