0

I am trying to search through a large chemical database (chembl, >1,000,000 entries), and am having problems executing code on my work computer. Our focus is chemicals, so high quality computers are not available.

My code is below, and works quickly for smaller entries (>5,000 entries). When looking at the full dataset, my 4GB of ram fills, and the computer halts. Any way to complete this task more effectively?

import pandas as pd
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors

filename1 = "chembl_22_chemreps_1"
fg = pd.read_csv('data/%s.txt' %filename1, sep='\t')   
#fg=fg.head(5000)
fg.drop(fg.columns[[2, 3]], axis=1,inplace=True)
PandasTools.AddMoleculeColumnToFrame(fg, smilesCol='canonical_smiles')
fg['MW']=fg['ROMol'].map(Descriptors.MolWt)
fg['Aromatic']=fg['ROMol'].map(Descriptors.NumAromaticRings)
fg['Aliphatic']=fg['ROMol'].map(Descriptors.NumAliphaticRings)
fg = fg[(fg['Aromatic'] ==0) &
(fg['Aliphatic'] ==0) &
(fg['MW'] < 1000) &  
(fg['MW'] > 50)] 

The code loads the database, converts the smiles to RDkit molecular info, searches and removes rings, or molecules with MW below 50, above 1000.

Any tips?

MrJones
  • 1
  • 1
  • 1
    You say it works for 5000 entries? Split your input file into 5000 line chunks and loop over those. – WombatPM Nov 16 '17 at 15:16
  • It works doing it that way. Is there anything inefficient about my code, or is it just a lot of data to work with? – MrJones Nov 16 '17 at 16:48
  • May be worth trying with pandas on ray (http://ray.readthedocs.io/en/latest/pandas_on_ray.html). Made to deal with large amounts of data using pandas – JoshuaBox May 02 '18 at 12:09

0 Answers0