I am trying to search through a large chemical database (chembl, >1,000,000 entries), and am having problems executing code on my work computer. Our focus is chemicals, so high quality computers are not available.
My code is below, and works quickly for smaller entries (>5,000 entries). When looking at the full dataset, my 4GB of ram fills, and the computer halts. Any way to complete this task more effectively?
import pandas as pd
from rdkit.Chem import PandasTools
from rdkit.Chem import Descriptors
filename1 = "chembl_22_chemreps_1"
fg = pd.read_csv('data/%s.txt' %filename1, sep='\t')
#fg=fg.head(5000)
fg.drop(fg.columns[[2, 3]], axis=1,inplace=True)
PandasTools.AddMoleculeColumnToFrame(fg, smilesCol='canonical_smiles')
fg['MW']=fg['ROMol'].map(Descriptors.MolWt)
fg['Aromatic']=fg['ROMol'].map(Descriptors.NumAromaticRings)
fg['Aliphatic']=fg['ROMol'].map(Descriptors.NumAliphaticRings)
fg = fg[(fg['Aromatic'] ==0) &
(fg['Aliphatic'] ==0) &
(fg['MW'] < 1000) &
(fg['MW'] > 50)]
The code loads the database, converts the smiles to RDkit molecular info, searches and removes rings, or molecules with MW below 50, above 1000.
Any tips?