How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?

Question

I'm using RDKit to calculate molecular similarity based on Tanimoto coefficient between two lists of molecules with SMILE structures. Now I'm able to extract the SMILE structures from two separate csv files. I'm wondering how to put these structures into the fingerprint module in RDKit, and how to calculate the similarity pairwise one by one between the two list of molecules?

from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])

I want to put all the SMILE structures I have (over 10,000) into the 'ms' list and get their fingerprints. Then I'll compare the similarity between each pair of molecules from the two lists, maybe a for loop is needed here?

Thanks in advance!

I used pandas dataframe to select and print out the lists with my structures, and I saved my lists into list_1 and list_2. When it runs to the ms1 line, it has the error as following:

TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t, 
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float

Then I checked the files and there's only SMILES in the smiles column. But when I manually put some molecule structures into the lists for testing, there are still errors regarding

fpArgs['minSize'].

For example, the SMILES for gadodiamide is "O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4)=O)CC[N]5(CC(=[O]6)NC)CC(=O)[O-]7)C1)NC", and the error codes are as following (when running the fps line):

ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1, 
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2, 
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True, 
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0, 
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).

How to include the molecule names in the output file along with the similarity values if the original csv file is as following:

names,smiles,value,value2

molecule1,CCOCN(C)(C),0.25,A

molecule2,CCO,1.12,B

molecule3,COC,2.25,C

I added these codes to include the molecule names in the output file, and these's some array value error regarding the names (particularly for d2):

name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1): 
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) 
    #print(c_smiles[n], c_smiles[n+1:])
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
        d1.append(name_3[n])
        d2.append(name_3[n+1:][m])
    #print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
    print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')

rapelpy · Accepted Answer · 2018-08-21T17:23:48.257

11

Edited the answer to catch all comments.

RDKit has a bulk funktion for similarity, so you can compare one fingerprint against a list of fingerprints. Just loop over the list of fingerprints.

If the CSV's looks like this

First csv with an invalid SMILES

smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C

Second csv with correct SMILES

smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F

This is how to read out the SMILES, delete the invalid ones, do the fingerprint similarity without duplicates and save the sorted values.

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])

# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)

# save as csv
df_final.to_csv('third.csv', index=False, sep=',')

The print out:

Invalid SMILES: CCOCN(C)(C)C

CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']

   query target  Similarity
9   CCCO  CCCCO    0.769231
2    CCO   CCCO    0.600000
1    CCO  CCOCC    0.500000
7  CCOCC   CCCO    0.466667
3    CCO  CCCCO    0.461538
8  CCOCC  CCCCO    0.388889
4    COC  CCOCC    0.333333
5    COC   CCCO    0.272727
0    CCO    COC    0.250000
6    COC  CCCCO    0.214286

edited Aug 21 '18 at 17:23

answered Aug 04 '18 at 06:58

rapelpy

1,684
1
11
14

Thanks for your answer! Your codes works well. So how can I import my structures from the csv file into the two lists? – Anna Zhou Aug 06 '18 at 21:11
In your question you wrote that you are able to extract the SMILES from a csv. Didn't you put them in a list? What did you do? – rapelpy Aug 07 '18 at 04:15
I used pandas dataframe to select and print out the lists with my structures, and I saved my lists into list_1 and list_2. When it runs to the ms1 line, it has the error as following: TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string, std::allocator > from this Python object of type float – Anna Zhou Aug 07 '18 at 21:19
In your lists there a float numbers and not SMILES. I add a pandas/csv example to my answer. – rapelpy Aug 08 '18 at 14:32
Thank you. I tried and it returned the same error code at the same point. – Anna Zhou Aug 09 '18 at 19:10
Check if the csv is correct. Maybe there a some numbers instead of SMILES. – rapelpy Aug 11 '18 at 05:21
I edited my questions to include the error codes. I think the csv files are just fine with only smiles structures. Thank you! – Anna Zhou Aug 14 '18 at 21:13
There are also several blanks in the smiles column (the drug molecules with no SMILES structures), when I print these elements from the lists, the output shows "nan". Will it affect the overall results? – Anna Zhou Aug 14 '18 at 22:36
If there is a "nan" you get the "float error". Delete the row or put in a SMILES by hand. But be careful with the the SMILES, because RDKit is very strict with SMILES. Your Gadodiamide-SMILES will not work with RDKit, but when I use the SMILES from PubChem "CNC(=O)CN(CCN(CCN(CC(=O)NC)CC(=O)[O-])CC(=O)[O-])CC(=O)[O-].[Gd+3]" it works. It is always good to proof the SMILES before use, but that's another question. – rapelpy Aug 15 '18 at 16:28
Yes I've removed the "nan" values, and it returns the same error. Do you know how to write it as a for loop to check if there's some problematic structures? – Anna Zhou Aug 16 '18 at 18:24
I always change my SMILES to canonical SMILES, so I can check if there good and later check if there are duplicates. I edit my answer with the proof. – rapelpy Aug 16 '18 at 19:55
After replacing some of the structures with canonical SMILES, I finally got my matrix out! THANK YOU! and one more question, since the output is a huge matrix with only numbers, how can I pull out the values into a csv file, for example, with their names, so that I can know which molecule is compared to which ones, and can get a ranking of all the similarity values of it? – Anna Zhou Aug 17 '18 at 00:22
Conconate cs1 and cs2 instead of ms1 and ms2. Slice the values and SMILES during the similarity-loop and append them into lists. Transform the lists to a dataframe, sort the dataframe and save it as a csv. These a python and pandas basics you should learn because you will need them often. – rapelpy Aug 18 '18 at 06:51
I've tried to have cs=cs1+cs2 and there's still error shown, and I cannot get the molecule names out along with the similarity values. Can you show me some sample codes on how to do this? I'm really new to python, I'm learning the basics just in slow progress. – Anna Zhou Aug 21 '18 at 02:57
Did a complete edit of the answer for a more complete solution. Try it with my sample csv's to see how it works. – rapelpy Aug 21 '18 at 17:24
Thank you! I finally had a really big file out, but many of the SMILES structures seem to change in the new file. When I try to copy and search a certain structure from the new file into the original file, I cannot find it. For example, in the original file, the structure of N-Methyltryptophan is C[NH2+][C@@H](Cc1c[nH]c2ccccc12)C([O-])=O, and change to C[NH2+][C@@H](Cc1c[nH]c2ccccc12)C(=O)[O-] in the new file. So it's hard to match the molecules. – Anna Zhou Aug 24 '18 at 00:33
That's because all your SMILES where replaced with canonical SMILES. If you want the original SMILES only to be checked, but not replaced change 'cs = Chem.CanonSmiles(ds)' --> 'Chem.CanonSmiles(ds)' and 'c_smiles.append(cs)' --> 'c_smiles.append(ds)'. – rapelpy Aug 24 '18 at 15:41
Thank you! So the output file contains the smiles and similarity values in it. In order to compare the results easily, how to include the molecule names in the output file as well? I've edited the question to show the example csv file. Thanks. – Anna Zhou Aug 31 '18 at 21:10
You get the molecules names the same way you got the SMILES in the output. To get SMILES and names parralel use Pandas iterrows. Search for 'for index, row in df.iterrows()' and you can find explanations and examples. – rapelpy Sep 02 '18 at 09:00
Thank you! I've learned the codes and applied to my scripts. I edited my question to include those codes. The new error here is about the array value, for line "d2.append(name_3[n+1:][m])". Do you know how to solve this problem? – Anna Zhou Sep 05 '18 at 09:04
'c_smiles' and 's' are lists, but name_3 is a dataframe. You have to put the names into a list 'na3 = [na for na in name_3]'. Now you could append from na3 instead of name_3, but this will not work, because all the names of the invalid SMILES are in the list, so you have to make a clean list of names like you first made a clean list of SMILES. I mentioned above that you can do it the same way you made the list of SMILES. – rapelpy Sep 05 '18 at 17:24
@rapelpy, I have tried your ans. in a little different way. Like, You made a 1D list for `ms` and `fps`. But, I need the drug name in my output file. That's why I made a 2D list. More clearly, in my last list, I have 4 information (drug name, smiles, ms, and fps). Now, could you tell me, how can I get the `score` using `DataStructs.BulkTanimotoSimilarity`. Because my list became a little bit complexly. I am trying to write in this way, `s = DataStructs.BulkTanimotoSimilarity(fps[n][3], fps[n+1:][3])`. Please, note it I am using `[3]` because my `fps` is in my `4th column of the 2D list`. – mostafiz67 Oct 25 '20 at 17:08
@mostafiz67 I can not open your notebook (it is not made public) and please start a new question. – rapelpy Oct 25 '20 at 19:14
@rapelpy would you mind to check my question (https://stackoverflow.com/questions/67878866/valueerror-bitvects-must-be-same-length-rdkit?noredirect=1#comment119985854_67878866). I would be grateful. – Opps_0 Jun 08 '21 at 19:00

How to use RDKit to calculte molecular fingerprint and similarity of a list of SMILE structures?

1 Answers1