This wannabe bioinformatician needs your help. The code below finds the similarity of compounds' canonical smiles, using rdkit. After some research I understand it must be O(n)! (or not?) because for a small file of 944 entries it took 20 minutes while for the largest one which is 330.000 entries has been running for over 30 hours. Now, I now that one of its problems is that it doesn't compare the elements only once so that is one factor which slows it down. I read here that you can use the itertools library to make a comparison fast, but generally how could this code be made better? Any help would be appreciated while I try to learn :)
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem import AllChem
import pandas as pd
l =[]
s1 = []
s2 = []
d1 = []
d2 = []
with open('input_file.csv', 'r') as f:
df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = 0)
for i in range(0, df.shape[0]):
l.append(df.iloc[i, 1])
for i in range(0, df.shape[0]):
for j in range(0, df.shape[0]):
m1 = Chem.MolFromSmiles(df.iloc[i, 1])
fp1 = AllChem.GetMorganFingerprint(m1,2)
m2 = Chem.MolFromSmiles(df.iloc[j, 1])
fp2 = AllChem.GetMorganFingerprint(m2,2)
sim = DataStructs.DiceSimilarity(fp1,fp2)
if sim >= 0.99:
s1.append(i)
s2.append(j)
for k in range(0, len(s1)):
if df.iloc[s1[k], 0] != df.iloc[s2[k], 0]:
d1.append(df.iloc[s1[k], 0])
d2.append(df.iloc[s2[k], 0])
if len(d1) != 0:
with open('outputfile.tsv', 'a') as f2:
for o in range(0, len(d1)):
f2.write(str(d1[o]) + '\t' + str(d2[0]) + '\n')