2

I am calculating the structure similarity profile between 2 moles using rdkit. When I am running the program in google colab (rdkit=2020.09.2 python=3.7) the program is working fine.

I am getting an error when I am running on my PC (rdkit=2021.03.2 python=3.8.5). The error is a bit strange. The dataframe contains 500 rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error

 s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:]) 
    ValueError: BitVects must be same length

The block of code is given below

  data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
 
  
  #Proff and make a list of Smiles and id
  c_smiles = []
  count = 0
  for index, row in data.iterrows():
    try:
      cs = Chem.CanonSmiles(row['SMILES'])
      c_smiles.append([row['ID_Name'], cs])
    except:
      count = count + 1
      print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])

  # make a list of id, smiles, and mols
  ms = []
  df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
  for index, row in df.iterrows():
    mol = Chem.MolFromSmiles(row['SMILES'])
    ms.append([row['ID_Name'], row['SMILES'], mol])

  # make a list of id, smiles, mols, and fingerprints (fp)
  fps = []
  df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
  df_fps.head

  for index, row in df_fps.iterrows():
    fps_cal = FingerprintMols.FingerprintMol(row['mol'])
    fps.append([row['ID_Name'], fps_cal])


  fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
  fps_2 = fps_2[fps_2.columns[1]]
  fps_2 = fps_2.values.tolist()


  # compare all fp pairwise without duplicates
  for n in range(len(fps_2)): 
      s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
      for m in range(len(s)):
          qu.append(c_smiles2[n])
          ta.append(c_smiles2[n+1:][m])
          sim.append(s[m])

Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2?

Reproducible Data

DB00607 [H][C@]12SC(C)(C)[C@@H](N1C(=O)[C@H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C@@H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C@@]12[C@H]3CC[C@H](C3)[C@]1([H])C(=O)N(C[C@@H]1CCCC[C@H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C@]1(C)CN(C[C@@]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C@H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1
Opps_0
  • 408
  • 4
  • 19
  • Can you add the shape of the bit vectors for which you're getting the error? – betelgeuse Jun 08 '21 at 07:18
  • @mnis thanks for your comment. I have checked the shape of the bit vectors in both (colab and my pc). They showed the same len for the bit vector – Opps_0 Jun 08 '21 at 14:33
  • @mnis first [1:20] len of the bit vectors `2048 2048 2048 2048 2048 2048 2048 2048 2048 1024 2048 2048 2048 2048 2048 2048 2048 1024 2048`. The same result from colab and PC. But colab is working fine but pc is showing the error – Opps_0 Jun 08 '21 at 14:34

2 Answers2

3

To answer first on how to install a specific version of Rdkit, you can run this command:

conda install -c rdkit rdkit=2020.09.2

Coming to the original question, the error is coming because of the function:

FingerprintMols.FingerprintMol()

For whatever internal reasons, it's converting the first 10 SMILES to a 2048 length vector while the 11th SMILES to a 1024 length vector. The older versions are able to handle this mismatch but newer versions can't. There are two options to fix this:

  1. Downgrade RdKit to an older version using the command I mentioned above.
  2. Fix the length of the vector by passing it as an argument. Basically, replace the line
FingerprintMols.FingerprintMol(row['mol'])

with

FingerprintMols.FingerprintMol(row['mol'], minPath=1, maxPath=7, fpSize=2048,
                               bitsPerHash=2, useHs=True, tgtDensity=0.0,
                               minSize=128)

In the replacement, all arguments other than fpSize are set to their default values and fpSize is fixed to 2048. Please note that you must pass all the arguments and not just fpSize.

betelgeuse
  • 1,136
  • 3
  • 13
  • 25
  • Thank you very much. It worked. I spent 2 days! Now, I would like to know 2 things, 1) This will not change the score (Tanimoto score). Right? Because, in previous, we had different bit vectors and now we have same bit vectors! 2) As I install rdkit using `pip install rdkit-pypi`. Is there anyway to downgrade the rdkit using `pip`. I don't have `conda`! – Opps_0 Jun 09 '21 at 13:56
  • 1
    It won't change the score to the best of my knowledge. You can install the two versions in two separate environments and then compare just to be sure. To install any specific version of rdkit, you can go to this [link](https://pypi.org/project/rdkit-pypi/#history) and then choose your version. In the top left side, you'll see the pip command to install that version. Something like `pip install rdkit-pypi==` – betelgeuse Jun 10 '21 at 04:34
3

Just to extend on mnis's answer, since FingerPrintMol defaults to the RDKFingerprint, you may find it easier to use it directly, as it is much more flexible, plus you will not have to supply all the arguments. Tested on version 2021.03.3

Chem.RDKFingerprint(row['mol'], fpSize=2048)
Oliver Scott
  • 1,673
  • 8
  • 17