Calculate Tanimoto coefficient for dataframe

Question

I have a table that looks like this:

and I want to calculate Tanimoto coefficient (Molecular similarity measure) by RDkit in python in order to have below result:

but I failed.

My data:

{'name': ['16β-hydro-ent-kauran-17-oic acid ',
  '16α-hydro-entkauran-17-oic acid ',
  'ent-kaur-16-en-19-oic acid',
  '16β,17-dihydroxy-ent-kauran-19-oic acid ',
  'annomontacin'],
 'canonical_smile': ['CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC12CCCC(C1CCC34C2CCC(C3)C(=C)C4)(C)C(=O)O',
  'CC12CCCC(C1CCC34C2CCC(C3)C(C4)(CO)O)(C)C(=O)O',
  'CCCCCCCCCCCCC(C1CCC(O1)C(CCCCCCC(CCCCCC(CC2=CC(OC2=O)C)O)O)O)O']}

Here is my code:

import pandas as pd
import itertools
import matplotlib.pyplot as plt
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    PandasTools,
    Draw,
    Descriptors,
    MACCSkeys,
    rdFingerprintGenerator)

# Create two columns (SMILEs) from the combination of one column (SMILEs).
df3 = pd.DataFrame(list(itertools.combinations(df['canonical_smile'].unique(), 2)), 
                                   columns=['canonical_smile1', 'canonical_smile2']).dropna()
# Create two columns ROMoL objects from two columns (SMILEs).
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile1','ROMol1',includeFingerprints=True)
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile2','ROMol2',includeFingerprints=True)
# Calculate the circular Morgan fingerprints of two columns ROMoL objects 
df3["morgan1"] = rdFingerprintGenerator.GetFPs(df3["ROMol1"].tolist())
    df3["morgan2"] = rdFingerprintGenerator.GetFPs(df3["ROMol2"].tolist())
# Add the Tanimoto similarities between the Morgan fingerprints.
    df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

and this is my error:

    ArgumentError: Python argument types in
    rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity(Series, Series)
did not match C++ signature:
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned __int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<__int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class ExplicitBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)
    BulkTanimotoSimilarity(class SparseBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)

can you include a sample of your dataframe as formatted text so we can more easily reproduce your error? instead of the screenshots, you can copy and paste the output from `df1.head().to_dict()` into your question, thanks! — Derek O, Feb 13 '23 at 05:17

Derek O · Answer 1 · 2023-02-13T19:03:40.953

0

Disclaimer: I don't have much chemistry background, but my understanding is BulkTanimotoSimilarity is a similarity metric between a query fingerprint and a list of target fingerprints (based on this article).

From the error message, you are passing arguments that are of type pd.Series, pd.Series to BulkTanimotoSimilarity when this method expects a SparseIntVect and a list (or list-like) as inputs.

So if we take each bit vector in column morgan1 to be your query fingerprint, and take the entire column morgan2 to be your list of target fingerprints, we can do something like the following:

df3["tanimoto_morgan"] = df3['morgan1'].map(lambda morgan1: DataStructs.BulkTanimotoSimilarity(morgan1, df3['morgan2']))

This compiles and results in the following column being added to df3:

>>> df3['tanimoto_morgan']
0    [0.42592592592592593, 0.4107142857142857, 0.07...
1    [0.42592592592592593, 0.4107142857142857, 0.07...
2    [0.42592592592592593, 0.4107142857142857, 0.07...
3    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
4    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
5    [0.5272727272727272, 1.0, 0.08536585365853659,...
Name: tanimoto_morgan, dtype: object

edited Feb 13 '23 at 19:03

answered Feb 13 '23 at 16:17

Derek O

16,770
4
24
43

I have got an error when I use your code. Anyway, thanks a lot – jacobdavis Feb 14 '23 at 02:21
@jacobdavis what's the error you're getting? – Derek O Feb 14 '23 at 02:22
I'm tired, and I talked with ChatGPT, and the outcome is quite surprising. He fixed it for me, and I can now run smoothly. `df3["tanimoto_morgan"] = [DataStructs.TanimotoSimilarity(fp1, fp2) for fp1, fp2 in zip(df3["morgan1"], df3["morgan2"])]` – jacobdavis Feb 14 '23 at 02:28
that is surprising because i'm pretty certain i tried this myself and it didn't work, but it seems to work now. and my code runs too but gives you a list for each row instead of an individual value. but anyway, i'm glad you got the right answer – you might consider adding the chatgpt generated answer as an answer when you have time just to help others in the future – Derek O Feb 14 '23 at 02:36

score 0 · Answer 2 · answered Feb 14 '23 at 07:44

I think that the problem is as follows:

df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

I have fixed it with this code below, and it now runs normally:

df3["tanimoto_morgan"] = [DataStructs.TanimotoSimilarity(fp1, fp2) for fp1, fp2 in zip(df3["morgan1"], df3["morgan2"])]

Calculate Tanimoto coefficient for dataframe

2 Answers2