Speeding up the implementation of the similarity of compounds' canonical smiles using rdkit

Question

This wannabe bioinformatician needs your help. The code below finds the similarity of compounds' canonical smiles, using rdkit. After some research I understand it must be O(n)! (or not?) because for a small file of 944 entries it took 20 minutes while for the largest one which is 330.000 entries has been running for over 30 hours. Now, I now that one of its problems is that it doesn't compare the elements only once so that is one factor which slows it down. I read here that you can use the itertools library to make a comparison fast, but generally how could this code be made better? Any help would be appreciated while I try to learn :)

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem import AllChem
import pandas as pd


l =[]
s1 = []
s2 = []
d1 = []
d2 = []
with open('input_file.csv', 'r') as f:
    df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = 0)
    for i in range(0, df.shape[0]):
        l.append(df.iloc[i, 1])


for i in range(0, df.shape[0]):
    for j in range(0, df.shape[0]):
        m1 = Chem.MolFromSmiles(df.iloc[i, 1])
        fp1 = AllChem.GetMorganFingerprint(m1,2)
        m2 = Chem.MolFromSmiles(df.iloc[j, 1])
        fp2 = AllChem.GetMorganFingerprint(m2,2)
        sim = DataStructs.DiceSimilarity(fp1,fp2)
        if sim >= 0.99:
            s1.append(i)
            s2.append(j)
for k in range(0, len(s1)):
    if df.iloc[s1[k], 0] != df.iloc[s2[k], 0]:
        d1.append(df.iloc[s1[k], 0])
        d2.append(df.iloc[s2[k], 0])
if len(d1) != 0:
    with open('outputfile.tsv', 'a') as f2:
        for o in range(0, len(d1)):
            f2.write(str(d1[o]) + '\t' + str(d2[0]) + '\n')

you retrieve `m1` and `fp1` in every `j`'th iteration although it can be on the outer loop alone and the result will be the same. for start..; then I think that you could drop the `k` loop and enter the logic to the 2 nested loops — David, Feb 15 '20 at 09:48
I notice that your similarity threshold is 0.99, so essentially you are searching for identical molecules. If your SMILES strings are truly canonical then identical compounds with have identical SMILES strings. Instead of computing a molecular graph and fingerprint you could just directly compare the SMILES — Oliver Scott, Feb 17 '20 at 17:33

Matus Dubrava · Answer 1 · 2020-02-15T15:18:47.490

I have no idea what the algorithm is supposed to do, therefore I am not going to comment on it. But, you are stating that:

After some research I understand it must be O(n)!

What does the n represent? If the algorithm's time complexity is linear with respect to the number of rows in your dataset then your implementation must be incorrect. You have two nested loops in your code, both with the length of n which means that your algorithm is in O(n^2) at best (not considering what the other functions inside of the loop are doing).

Here are some suggestions how to speed up the code to a certain degree (in general when working with pandas).

You should avoid doing iterations on your own and you should avoid turning pandas data structures into python lists. Here is an example:

for i in range(0, df.shape[0]):
        l.append(df.iloc[i, 1])

If you really need to store this in another variable then you can use

l = df.iloc[:, 1].copy()

This will be faster and it will not turn that series into a list (but I don't see l being used anywhere in your code so you should probably drop it entirely).

Another example is when you are computing those functions inside of the nested loop (again, I don't know what they are doing but it doesn't really matter).

for i ...
    for j ...
        m1 = Chem.MolFromSmiles(df.iloc[i, 1])
        fp1 = AllChem.GetMorganFingerprint(m1,2) 
        m2 = Chem.MolFromSmiles(df.iloc[j, 1])
        fp2 = AllChem.GetMorganFingerprint(m2,2)

First, you are computing the same values twice which can be time consuming and you are doing it within your custom loop which is not the best idea either.

Instead of those 4 lines (6 including the loop statements), you can create a new column of fp values:

df["fp"] = df.iloc[:, i].copy()
df["fp"].apply(lambda x: AllChem.GetMorganFingerprint(Chem.MolFromSmiles(x), 2))

This way, you don't have to compute the values twice and you don't have to write your own loop (at least for this part).

At this point, you will need to figure out how the mentioned O(n) algorithm works but I guess that it can be translated into pure vector operations which is probably going to be the most efficient implementation.

Speeding up the implementation of the similarity of compounds' canonical smiles using rdkit

1 Answers1