3

I would like to use rdkit to generate count Morgan fingerprints and feed them to a scikit Learn model (in Python). However, I don't know how to generate the fingerprint as a numpy array. When I use

from rdkit import Chem
from rdkit.Chem import AllChem
m = Chem.MolFromSmiles('c1cccnc1C')
fp = AllChem.GetMorganFingerprint(m, 2, useCounts=True)

I get a UIntSparseIntVect that I would need to convert. The only thing I found was cDataStructs (see: http://rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html), but this does not currently support UIntSparseIntVect.

denfromufa
  • 5,610
  • 13
  • 81
  • 138
evilolive
  • 407
  • 1
  • 5
  • 12
  • It seems you have to convert the counts by yourself. You can get the counts with fp.GetNonzeroElements() `{98513984: 2, 422715066: 1, 951226070: 1, 1100037548: 1, 1207774339: 1, 1235524787: 1, 1751362425: 1, 2041434490: 1, 2246728737: 1, 2614860224: 1, 3217380708: 1, 3218693969: 4, 3776905034: 1, 3999906991: 1, 4036277955: 1, 4048591891: 1}` – rapelpy Feb 21 '19 at 18:12
  • I saw that but how would I fold that to a fp of reasonable size, like 1024 digits? – evilolive Feb 21 '19 at 18:42
  • 1
    `fp = AllChem.GetHashedMorganFingerprint(m, 2, nBits=1024)` Or do you want the bits (0 and 1)? `fp = AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=1024)` This could be converted to an array with DataStructs – rapelpy Feb 21 '19 at 19:11
  • Yes, I want to use counts. `GetHashedMorganFingerprint` looks very good, I could not find this before. However, it creates a UIntSparseIntVect again, so DataStructs does not work. Any better option than to do: `fp_dict = fp.GetNonZeroElements()`and then loop over fp_dict.items() like so: `for key, val in fp_dict.items(): fp_arr[key] = val` ?? – evilolive Feb 22 '19 at 08:11
  • What about this `fp_arr = np.array(list(fp_dict.items()))` ? – rapelpy Feb 22 '19 at 18:20
  • No, given `dict = {1:2, 3:4}` and the fp would be 5 bit long, I would want `[0,2,0,4,0]` . Your solution gives `[[1,2],[3,4]]` (Sorry, I do not have rdkit installed on this machine.) Guess I will just go with the loop. May I write the answer or do you want to? – evilolive Feb 23 '19 at 12:51
  • Because the question is about creating an array, it's your answer. – rapelpy Feb 23 '19 at 16:36

2 Answers2

7

Maybe a little late to answer but these methods work for me

If you want the bits (0 and 1):

from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs

mol = Chem.MolFromSmiles('c1cccnc1C')
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024)
array = np.zeros((0, ), dtype=np.int8)
DataStructs.ConvertToNumpyArray(fp, array)

And back to a fingerprint:

bitstring = "".join(array.astype(str))
fp2 = DataStructs.cDataStructs.CreateFromBitString(bitstring)
assert list(fp.GetOnBits()) == list(fp2.GetOnBits())

If you want the counts:

fp3 = AllChem.GetHashedMorganFingerprint(mol, 2, nBits=1024)
array = np.zeros((0,), dtype=np.int8)
DataStructs.ConvertToNumpyArray(fp3, array)
print(array.nonzero())

Output:

(array([ 19,  33,  64, 131, 175, 179, 356, 378, 428, 448, 698, 707, 726,
   842, 849, 889]),)

And back to a fingerprint (Not sure this is the best way to do this):

def numpy_2_fp(array):
    fp = DataStructs.cDataStructs.UIntSparseIntVect(len(array))
    for ix, value in enumerate(array):
        fp[ix] = int(value)
    return fp

fp4 = numpy_2_fp(array)
assert fp3.GetNonzeroElements() == fp4.GetNonzeroElements()
Oliver Scott
  • 1,673
  • 8
  • 17
1
from rdkit.Chem import AllChem
m = Chem.MolFromSmiles('c1cccnc1C')
fp = AllChem.GetHashedMorganFingerprint(m, 2, nBits=1024)
fp_dict = fp.GetNonZeroElements()
arr = np.zeros((1024,))
for key, val in fp_dict.items():
    arr[key] = val

It seems there is no direct way to get a numpy array so I build it from the dictionary.

evilolive
  • 407
  • 1
  • 5
  • 12
  • AttributeError: 'UIntSparseIntVect' object has no attribute 'GetNonZeroElements' is what I get when I try to run this – jonalv Jan 12 '21 at 13:48