0

Hi i want to perform the task of grouping the same molecular structures by using smiles code.

However, even with the same structure, it is difficult to group them because the representation of dummy atoms is different.

I'm using the RDKIT program and I've tried changing several options but haven't found a solution yet. I would like to ask for your help. (rdkit version 2022.3.4)

Example smiles: (same structure but different smiles code -> desired code format)

  1. [1*]C(=O)OC, [13*]C(=O)OC -> *C(=O)OC
  2. [31*]C1=CC=CC2=C1C=CC=N2, [5*]C1=CC=CC2=C1C=CC=N2 -> *C1=CC=CC2=C1C=CC=N2
  3. [45*]C(N)=O, [5*]C(N)=O, [19*]C(N)=O, [16*]C(N)=O -> *C(N)=O
bad_coder
  • 11,289
  • 20
  • 44
  • 72
Park
  • 27
  • 6

1 Answers1

1

Sounds a little weired, but you can replace AnyAtom with AnyAtom.

You can use ReplaceSubstructs() for this.

from rdkit import Chem

smiles = ['[1*]C(=O)OC', '[13*]C(=O)OC',
          '[31*]C1=CC=CC2=C1C=CC=N2', '[5*]C1=CC=CC2=C1C=CC=N2',
          '[45*]C(N)=O', '[5*]C(N)=O', '[19*]C(N)=O', '[16*]C(N)=O']

search_patt = Chem.MolFromSmiles('*') # finds AnyAtom with or without numbers
sub_patt = Chem.MolFromSmiles('*')    # AnyAtom without numbers

for s in smiles:
    m=Chem.MolFromSmiles(s, sanitize=False)
    new_m = Chem.ReplaceSubstructs(m, search_patt, sub_patt, replaceAll=True)
    print(s , '-->', Chem.MolToSmiles(new_m[0], kekuleSmiles=True))

Output:

[1*]C(=O)OC --> *C(=O)OC
[13*]C(=O)OC --> *C(=O)OC
[31*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[5*]C1=CC=CC2=C1C=CC=N2 --> *C1=CC=CC2=C1C=CC=N2
[45*]C(N)=O --> *C(N)=O
[5*]C(N)=O --> *C(N)=O
[19*]C(N)=O --> *C(N)=O
[16*]C(N)=O --> *C(N)=O
rapelpy
  • 1,684
  • 1
  • 11
  • 14
  • Thank you for your help. It seems that the reason for the number is because the smiles code was created while removing some parts from the existing full molecule. – Park Dec 23 '22 at 01:15