0

I'm trying to convert chemical structures to ECFP data. Buy, I have a problem with the folding steps.

I understand all of the processes of generating ECFP data through D. Rogers and M. Hahn's paper (J. Chem. Inf. Model., Vol. 50, No. 5, 2010)

I used a pinky module in python for computing the ECFP of each molecule. (https://github.com/ubccr/pinky/blob/master/pinky/fingerprints/ecfp.py)

the output of this function is as follow

{6456320269923861509: 1,
 -3040533427843102467: 2,
 -7329542376511023568: 1,
 -5821485132112031149: 1,
 -643847807504931861: 1,
 3054809300354049582: 1,
 -3679727481768249355: 1,
 -2240115528993944325: 1,
 5159885938473603439: 1,
 1268207003089618622: 1,
 267156486644197995: 1,
 6401915128722912935: 1,
 -8944122298402911035: 1,
 -7116035920000285502: 1}

I know what it is and what it means.

but I don't know how to convert this data to binary data form.

In this website(https://docs.chemaxon.com/display/docs/extended-connectivity-fingerprint-ecfp.md), the above identifiers are converted to a fixed-length bit string (folding process)

How to convert the above atomic identifiers to the fixed-length bit string?

And Can anyone suggest an appropriate hash function for the ECFP method?

1 Answers1

1

I don't believe you need a hash function here as the keys in the dictionary you have shown seem to already be the hashes of the atomic neighbourhoods. I believe representing this as a fixed-length bit vector is as simple as bit_index = hash % n_bits:

assuming you are using standard modules and the variable hash_dict is the output you have shown.

n_bits = 1024  # Number of bits in fixed-length fingerprint
fp = [0 for _ in range(n_bits)]  # The fingerprint as a python list

# I ignore the counts here for a binary output
for nbrhood_hash in hash_dict.keys():
    bit = nbrhood_hash % n_bits
    fp[bit] = 1

# Take a look at non-zero indexes
indexes = [ix for ix, bit in enumerate(fp) if bit > 0]
indexes

>>> [5, 194, 197, 251, 253, 367, 558, 560, 595, 619, 679, 702, 1003, 1013]

I believe this way is equivalent(ish) to the RDKit package:

from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles('CC(C)Oc1ccc(-c2nc(-c3cccc4c3CC[C@H]4NCCC(=O)O)no2)cc1C#N')

# Sparse ECFP
fp_sparse = AllChem.GetMorganFingerprint(mol, 2)

# BitVector ECFP (fixed length)
fp_bitvect = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=n_bits)

# Convert hashes from sparse fingerprint into fixed-length indicies
bit_set = set()
for nbrhood_hash in fp_sparse.GetNonzeroElements().keys():
    bit = nbrhood_hash % n_bits  # Same as before
    bit_set.add(bit)

# Check these are equivalent to the rdkit fixed-length fingerprint
set(fp_bitvect.GetOnBits()) == bit_set

>>> True
Oliver Scott
  • 1,673
  • 8
  • 17
  • +1 nice answer! Are you able to answer any of the Rdkit questions here: https://mattermodeling.stackexchange.com/questions/tagged/rdkit ? – Nike Apr 02 '21 at 23:02