Indexing of strings (molecule SMILES)

Question

Also, please can someone adjust or give me advice on how to look at the second order of parenthesis. Same process as this, but with only parenthesis in second order (this code is first order). Can you make it so I can easily adjust? By second order I mean (()). for example C(C(C))C. Everything except The C with 2 brackets around is 0. Also, the same conditions occur. Much appreciated.

Hope you are well. I have this code where I am trying to index within a parenthesis. I want all atoms that are not in parenthesis (branched) to be 0. For example

CCC(C)CC
[0 0 0 1 0 0]

CC(CCC)CC
[0 0 1 2 3 0 0]

CC(C)(C)C
[0 0 1 1 0]

CC(CCC)(C)C
[0 0 1 2 3 1 0]

As you can see from the above examples, I am counting the number of atoms within the parenthesis, however any nested parenthesis (a branched atom within a branch) is given the value of the atom before the branch (the atom without a parenthesis around it).

Such as C(C(C)C)C would have [0 1 1 2 0].

This code works for all cases except ones such as these. Below are my desired outputs, incorrect output and my code. Thanks

Desired output

CCC(CC(C)CC)(C)C  [0, 0, 0, 1, 2, 2, 3, 4, 1, 0]
             ^                             ^

Incorrect output

CCC(CC(C)CC)(C)C  [0, 0, 0, 1, 2, 2, 3, 4, 4, 0]
             ^                             ^

import pandas as pd
from rdkit import Chem

def smile_grouping(s):
    group_counter = 1
    res = []
    open_brackets = 0
    branch_start_index = None
    last_non_nested_group = None

    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == 1:
                branch_start_index = i
                if last_non_nested_group is not None:
                    group_counter = last_non_nested_group + 1
        elif letter == ')':
            if open_brackets == 1:
                last_non_nested_group = None
            open_brackets -= 1
            if open_brackets == 0:
                branch_start_index = None
        elif letter not in ['[', ']', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'i', 'e']:
            if open_brackets == 1:
                if branch_start_index is not None and branch_start_index + 1 != i:
                    group_counter += 1
                res.append(group_counter)
                last_non_nested_group = group_counter
            elif open_brackets == 0:
                res.append(0)
            elif open_brackets > 1:
                res.append(last_non_nested_group)

    mol = Chem.MolFromSmiles(s)
    num_atoms = mol.GetNumAtoms()
    while len(res) < num_atoms:
        res.append(0)

    return res

df = pd.DataFrame(
    {'SMILES': [
        "CCC(CC(C)CC)(C)C",
        "CCC[I+](C)(C)C",
        "CCC(CCC(C))C"
    ]})
df['Indexed SMILES'] = df['SMILES'].apply(smile_grouping)
print(df)

I recommend that you learn about recursion. This is one way to solve the problem. — Code-Apprentice, Apr 03 '23 at 02:00
For `CCC(CC(C)...` Why does the nested `(C)` output a 2 and not a 3? — Code-Apprentice, Apr 03 '23 at 02:02
Because that nested (C) is bound to the second C in the parenthesis. In this (CC(C).. The first C in the bracket is 1, the second 2. The nested C is branched off the second C, so therefore 2 also. — YZman, Apr 03 '23 at 02:07
Ok, I think that makes sense, but seems to be a different rule from the first parentheses where `C(C...` would be `0 1 ...` rather than `0 0 ...`. — Code-Apprentice, Apr 03 '23 at 02:10
Yeah nested have different rules. As a bit of context, I did this for the whole group as this CC(C)C(C)C with the output being [1 2 2 3 3 4]. For a case with nests, I did CC(C(C)C)C with the output being [1 2 2 2 2 3]. this takes the nested atom on the same branch as the parent atom. I am now working within the parenthesis to repeat this process. I will keep filtering until I have no more nested atoms in my list — YZman, Apr 03 '23 at 02:15
This is something I am doing uniquely when I am filtering the main chain and the branch. I am removing the main chain, and now doing a branch count. An easy way to understand this is imagine a tree. I am getting lengths along the tree and assigning the length along the trunk where branches are. Then I an selecting the branches and measuring along that branch for branches off that branch. I am keep repeating this until there are no more branches. The issue I am currently having is that it is detecting the index incorrectly as can be seen in my post. I also need a way to go deeper in branching — YZman, Apr 03 '23 at 02:20
which is why there are 0s. It is separating the main trunk from my branch (parenthesis) — YZman, Apr 03 '23 at 02:21
Since you are picturing this as a tree, you definitely should read about recursion. Trees and recursion usually go hand in hand. — Code-Apprentice, Apr 03 '23 at 02:22
Thanks will do. In the meantime, is there a possibility you can spot a mistake in my code? If not it's all good. — YZman, Apr 03 '23 at 02:24

score 1 · Accepted Answer · answered Apr 03 '23 at 03:00

1

You need a way to reset group counter when it leaves the last close bracket. I added one to your open brackets logic as such:

        if open_brackets == 1:
            branch_start_index = i
            if last_non_nested_group is not None:
                group_counter = last_non_nested_group + 1
            else:
                group_counter = 1

This resets the group counter after it completes the final close bracket and enters a new open bracket.

answered Apr 03 '23 at 03:00

beh aaron

168
7

Thank you so much. Also, do you have an idea on how to repeat this process for higher order parenthesis whilst the lower orders are 0s? The same logic but for higher order parenthesis. So for example CC(C(C(C)C)C)CC [0 0 0 1 1 2 0 0 0]. Do you know a way to easily change it so account for this? – YZman Apr 03 '23 at 03:15

score -1 · Answer 2 · answered Apr 03 '23 at 03:59

Here is my solution. It works for different orders (except 0 order I have a separate code for it).

import pandas as pd
from rdkit import Chem

def smile_grouping(s, order):
    group_counter = 1
    res = []
    open_brackets = 0
    branch_start_index = None
    last_non_nested_group = None

    for i, letter in enumerate(s):
        if letter == '(':
            open_brackets += 1
            if open_brackets == order:
                branch_start_index = i
                if last_non_nested_group is not None:
                    group_counter = last_non_nested_group + 1
        elif letter == ')':
            if open_brackets == order:
                last_non_nested_group = None
            open_brackets -= 1
            if open_brackets == order - 1:
                branch_start_index = None
                group_counter = 1
                last_non_nested_group = None
            continue
        elif letter not in ['[', ']', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'i', 'e']:
            if open_brackets == order:
                if branch_start_index is not None and branch_start_index + 1 != i:
                    group_counter += 1
                res.append(group_counter)
                last_non_nested_group = group_counter
            elif open_brackets < order:
                res.append(0)
            elif open_brackets > order:
                res.append(last_non_nested_group)

    mol = Chem.MolFromSmiles(s)
    num_atoms = mol.GetNumAtoms()
    while len(res) < num_atoms:
        res.append(0)

    return res

df = pd.DataFrame(
    {'SMILES': [
        "CCC(CC(C)CC)(C)C",
        "CCC[I+](C)(C)C",
        "CCC(CCC(C))C"
    ]})
df['Indexed SMILES'] = df['SMILES'].apply(lambda x: smile_grouping(x, order=1))
print(df)

Indexing of strings (molecule SMILES)

2 Answers2