0

I'm tying to build ML model for chemistry. The amount of input data is pretty large (~1M molecules), and I can't just make full list of available descriptors for each molecule. So I use a sample and run my model on it to get a list of most important descriptors. How can I make descriptors of molecules using a list of the descriptors in modred. Also I will be glad to know another way to generate molecular descriptors.

Here is the code

res['mols'] = res['smiles'].swifter.apply(lambda x: Chem.MolFromSmiles(x))

from mordred import Calculator, descriptors
calc = Calculator(descriptors, ignore_3D=True)
desc = calc.pandas(res['mols'])

#The model implementation is ommitted

most_important = pd.DataFrame((desc.columns, model.feature_importances_)).T.sort_values(by = 1, ascending = False).head(100)[0].values #Here I have the list of most important descriptors
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Hi Roma. Could you explain a bit on what kind of error do you get? Is it the size limitation, is it the shape of this dataset? e.g. 2x2 or 3x2, or higher dimension? What does mordred function expect as an input? – TeilaRei Jul 23 '20 at 16:18
  • It's a weird logic of library. Each descriptor is a class. So the main question is how can I get classes from module having a list of their names as strings. – roma ichenko Jul 23 '20 at 17:28

0 Answers0