0

I'm new to RDKit. I need to do a cluster analysis of a database of compounds. I've downloaded 191K compounds from ZINC database in 3D mol2 format and now I need to obtain fingerprints using RDKit. First, I don't understand if it's possible to convert mol2 format into fingerprints and what kind of fingerprints is better for this type of analysis (I need to understand what chemotypes I have in the database in order to - eventually - find some representatives). Does anyone have suggestions?(practical suggestions are really appreciated, too). Thanks

Alice
  • 11

1 Answers1

0

RdKit supports the mol2 file loading. You can use the MolFromMol2File function for that.

from rdkit import Chem

mol2_paths = ['path1', 'path2', 'path3', ......]

mols = []
for path in mol2_paths:
    mols.append(Chem.MolFromMol2File(path))

The above function will load all the mol2 files and create a RdKit molecule object for all of them. Once an object is created, you can use it to calculate any of the properties (similar to how you would calculate if you had a SMILES string).

Now, for clustering, RdKit has a ClusterData module, you can use that. See the module here. See an example usage of the module here. Another example here. Check out this presentation on different methods of clustering in RdKit here. An alternative way to cluster here.

Hope this should be a sufficient information for you to go ahead.

betelgeuse
  • 1,136
  • 3
  • 13
  • 25