I'm using RDKit to calculate molecular similarity based on Tanimoto coefficient between two lists of molecules with SMILE structures. Now I'm able to extract the SMILE structures from two separate csv files. I'm wondering how to put these structures into the fingerprint module in RDKit, and how to calculate the similarity pairwise one by one between the two list of molecules?
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])
I want to put all the SMILE structures I have (over 10,000) into the 'ms' list and get their fingerprints. Then I'll compare the similarity between each pair of molecules from the two lists, maybe a for loop is needed here?
Thanks in advance!
I used pandas dataframe to select and print out the lists with my structures, and I saved my lists into list_1 and list_2. When it runs to the ms1 line, it has the error as following:
TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t,
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
Then I checked the files and there's only SMILES in the smiles column. But when I manually put some molecule structures into the lists for testing, there are still errors regarding
fpArgs['minSize'].
For example, the SMILES for gadodiamide is "O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC([O-]4)=O)CC[N]5(CC(=[O]6)NC)CC(=O)[O-]7)C1)NC", and the error codes are as following (when running the fps line):
ArgumentError: Python argument types in
rdkit.Chem.rdmolops.RDKFingerprint(NoneType, int, int, int, int, int, float, int)
did not match C++ signature:
RDKFingerprint(RDKit::ROMol mol, unsigned int minPath=1,
unsigned int maxPath=7, unsigned int fpSize=2048, unsigned int nBitsPerHash=2,
bool useHs=True, double tgtDensity=0.0, unsigned int minSize=128, bool branchedPaths=True,
bool useBondOrder=True, boost::python::api::object atomInvariants=0, boost::python::api::object fromAtoms=0,
boost::python::api::object atomBits=None, boost::python::api::object bitInfo=None).
How to include the molecule names in the output file along with the similarity values if the original csv file is as following:
names,smiles,value,value2
molecule1,CCOCN(C)(C),0.25,A
molecule2,CCO,1.12,B
molecule3,COC,2.25,C
I added these codes to include the molecule names in the output file, and these's some array value error regarding the names (particularly for d2):
name_1 = df_1['id1']
name_2 = df_2['id2']
name_3 = pd.concat([name_1, name_2])
# create a list for the dataframe
d1, qu, d2, ta, sim = [], [], [], [], []
for n in range(len(fps)-1):
s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:])
#print(c_smiles[n], c_smiles[n+1:])
for m in range(len(s)):
qu.append(c_smiles[n])
ta.append(c_smiles[n+1:][m])
sim.append(s[m])
d1.append(name_3[n])
d2.append(name_3[n+1:][m])
#print()
d = {'ID_1':d1, 'query':qu, 'ID_2':d2, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
for index, row in df.iterrows():
print (row["ID_1"], row["query"], row["ID_2"], row["target"], row["Similarity"])
print(df_final)
# save as csv
df_final.to_csv('RESULT_3.csv', index=False, sep=',')