I am calculating the structure similarity profile between 2 moles using rdkit
. When I am running the program in google colab (rdkit=2020.09.2
python=3.7
) the program is working fine.
I am getting an error when I am running on my PC (rdkit=2021.03.2
python=3.8.5
). The error is a bit strange. The dataframe contains 500
rows and the code is working only for the first 10 rows (0-9) and for later rows I am getting an error
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
ValueError: BitVects must be same length
The block of code is given below
data = pd.read_csv(os.path.join(os.path.join(os.getcwd(), "dataset"), "test_ssp.csv"), index_col=None)
#Proff and make a list of Smiles and id
c_smiles = []
count = 0
for index, row in data.iterrows():
try:
cs = Chem.CanonSmiles(row['SMILES'])
c_smiles.append([row['ID_Name'], cs])
except:
count = count + 1
print('Count Invalid SMILES:', count, row['ID_Name'], row['SMILES'])
# make a list of id, smiles, and mols
ms = []
df = DataFrame(c_smiles,columns=['ID_Name','SMILES'])
for index, row in df.iterrows():
mol = Chem.MolFromSmiles(row['SMILES'])
ms.append([row['ID_Name'], row['SMILES'], mol])
# make a list of id, smiles, mols, and fingerprints (fp)
fps = []
df_fps = DataFrame(ms,columns=['ID_Name','SMILES', 'mol'])
df_fps.head
for index, row in df_fps.iterrows():
fps_cal = FingerprintMols.FingerprintMol(row['mol'])
fps.append([row['ID_Name'], fps_cal])
fps_2 = DataFrame(fps,columns=['ID_Name','fps'])
fps_2 = fps_2[fps_2.columns[1]]
fps_2 = fps_2.values.tolist()
# compare all fp pairwise without duplicates
for n in range(len(fps_2)):
s = DataStructs.BulkTanimotoSimilarity(fps_2[n], fps_2[n+1:])
for m in range(len(s)):
qu.append(c_smiles2[n])
ta.append(c_smiles2[n+1:][m])
sim.append(s[m])
Can you tell me why I am getting this error on my PC while the code is working fine in Google Colab? How can I solve the issue? Is there anyway to install rdkit=2020.09.2
?
Reproducible Data
DB00607 [H][C@]12SC(C)(C)[C@@H](N1C(=O)[C@H]2NC(=O)C1=C(OCC)C=CC2=CC=CC=C12)C(O)=O
DB01059 CCN1C=C(C(O)=O)C(=O)C2=CC(F)=C(C=C12)N1CCNCC1
DB09128 O=C1NC2=CC(OCCCCN3CCN(CC3)C3=C4C=CSC4=CC=C3)=CC=C2C=C1
DB04908 FC(F)(F)C1=CC(=CC=C1)N1CCN(CCN2C(=O)NC3=CC=CC=C23)CC1
DB09083 COC1=C(OC)C=C2[C@@H](CN(C)CCCN3CCC4=CC(OC)=C(OC)C=C4CC3=O)CC2=C1
DB08820 CC(C)(C)C1=CC(=C(O)C=C1NC(=O)C1=CNC2=CC=CC=C2C1=O)C(C)(C)C
DB08815 [H][C@@]12[C@H]3CC[C@H](C3)[C@]1([H])C(=O)N(C[C@@H]1CCCC[C@H]1CN1CCN(CC1)C1=NSC3=CC=CC=C13)C2=O
DB09143 [H][C@]1(C)CN(C[C@@]([H])(C)O1)C1=CC=C(NC(=O)C2=CC=CC(=C2C)C2=CC=C(OC(F)(F)F)C=C2)C=N1
DB06237 COC1=C(Cl)C=C(CNC2=C(C=NC(=N2)N2CCC[C@H]2CO)C(=O)NCC2=NC=CC=N2)C=C1
DB01166 O=C1CCC2=C(N1)C=CC(OCCCCC1=NN=NN1C1CCCCC1)=C2
DB00813 CCC(=O)N(C1CCN(CCC2=CC=CC=C2)CC1)C1=CC=CC=C1