-1

I have a list of 2215 molecules encoded as 2048 bit vectors. What I'm trying to do is to create 2D array from it. I am using rdkit library to convert to numpy arrays. The code worked without a problem few weeks ago and now there is a memory error but I can't figure out why. Can anyone provide a solution?

I tried to make the list smaller and reduced it down to two vectors. I thought it would help but the error stills pops out after some time of processing. That leads me to believe that I in fact do have enough memory.

# red_fp is the list of bit vectors

def rdkit_numpy_convert(red_fp):
    output = []
    for f in fp:
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(f, arr)
        output.append(arr)
    return np.asarray(output)

# this one line causes the problem
x = rdkit_numpy_convert(red_fp)

this is the error:

MemoryError  Traceback (most recent call last)
MemoryError: cannot allocate memory for array

The above exception was the direct cause of the following exception:

SystemError  Traceback (most recent call last)
<ipython-input-14-91594513666c> in <module>
----> 1 x = rdkit_numpy_convert(red_fp)

<ipython-input-13-78d1c9fdd07e> in rdkit_numpy_convert(red_fp)
      4     for f in fp:
      5         arr = np.zeros((1,))
----> 6         DataStructs.ConvertToNumpyArray(f, arr)
      7         output.append(arr)
      8     return np.asarray(output)

SystemError: <Boost.Python.function object at 0x55a2a5743520> returned a result with an error set
Alexander Rossa
  • 1,900
  • 1
  • 22
  • 37
Jozef
  • 7
  • 2
  • 2
    Can you share an example definition of the `red_fp` that you are using? Also, mentioning that you are using rdkit and tagging the question as such would attract the right people to help you with the question. – Alexander Rossa Jul 07 '19 at 16:12
  • I'm not sure what you mean by example definition. One fingerprint is just 2048 integers (either 0 or 1, mostly 0). fp is then a list of 2215 fingerprints, red_fp is just a small number of fingerprints to check if it'd worked with smaller amount of information – Jozef Jul 07 '19 at 16:37
  • 1
    The function is not using `red_fp` but `fp`. Where does `fp` come from? is it in the global scope? is this your intention? Perhaps you want the function to use red_fp?: `for f in red_fp:` – RvdBerg Jul 07 '19 at 17:01
  • It was first written with fp. I didn't notice it, now it is fixed (for f in red_fp) but the error persists. – Jozef Jul 07 '19 at 17:25
  • For me your code works with 5979 molecules, RDKit 2019.03.3, numpy 1.16.4, Python 3.7.3, Windows 7 and 4GB RAM. – rapelpy Jul 08 '19 at 04:13

2 Answers2

1

I believe that your problem is that the fingerprints you are using are not compatible with this method for converting to numpy arrays.

I am not sure what type of fingerprint you are using, but assuming you are using morgan fingerprints, I did some quick experiments and this method seems to hang when I use the 'GetMorganFingerprint' method versus the 'GetMorganFingerprintAsBitVect' method. I am not sure why this problem occurs but I assumed it was due to the fact that the first method produces a UIntSparseIntVect versus an ExplicitBitVect although I found that when I attempted the same method with a fingeprint produced by the 'GetHashedMorganFingerprint', which also returns a UIntSparseIntVect it works fine.

I suggest if you are using morgan fingerprints to try the 'GetMorganFingerprintAsBitVect' method

Edit:

I did a couple more experiments

mol = Chem.MolFromSmiles('c1ccccc1')

fp = AllChem.GetMorganFingerprint(mol, 2)
print(fp.GetLength())
'4294967295'

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
print(fp1.GetNumBits())
'2048'

fp2 = AllChem.GetHashedMorganFingerprint(mol, 2)
print(fp2.GetLength())
'2048'

As you can see the fingerprint from the first method is huge, my initial thought is that this fingerprint is in an unfolded state, hence a sparse data-structure is used, this would explain why you have problems trying to allocate memory for a fingerprint of this dimension.

Oliver Scott
  • 1,673
  • 8
  • 17
0

This is the first time I've heard of rdkit, but it looks like this is a Boost wrapper for C++ code.

From the docs, https://www.rdkit.org/docs/source/rdkit.DataStructs.cDataStructs.html

the second argument to ConvertToNumpyArray is destArray.

rdkit.DataStructs.cDataStructs.ConvertToNumpyArray((ExplicitBitVect)bv, 
    (AtomPairsParameters)destArray) → None :¶

My guess is that this function tries to put the converted values into the destArray. It isn't trying to allocate new memory itself (as a conventional numpy constructor would), but rather just fill the array that it was given.

If that guess is right, then the error is in the

arr = np.zeros((1,))

That arr only has space for one float, 8 bytes. arr needs to be big enough (and the right dtype) to hold the result produced by Convert.

Is there any documentation or examples illustrating the use of this conversion? When asking questions about low traffic tags like [rdkit] it helps if you include some links to documentation and example code.


I glanced at other [rdkit] SO.

How can I compute a Count Morgan fingerprint as numpy.array?

suggests that I'm wrong. The accepted answer uses

np.zeros((0,), dtype=np.int8)

which allocates 0 bytes to its data buffer.

And another that uses np.zeros((1,))

ValueError when doing validation with random forests

hpaulj
  • 221,503
  • 14
  • 230
  • 353