PySpark, read multiline file (.sdf)

Question

What is the most efficient way to read a collection of sdf files? sdf is a chemical table file, containing both 3D information about molecules but also properties of said molecule. All of this information is stored in a multiline (gzipped) ASCII file. What I am struggling with is defining a custom file reader function that is able to interpret the custom subsection of each molecular entry. At this point I'm doubting if this is even the right approach.

<Molecular-ID>
  -OEChem-10272110393D
 Schrodinger Suite 2021-1.
 32 34  0     0  0  0  0  0  0999 V2000
   31.1383   33.3647   21.1400 C   0  0  0  0  0  0  0  0  0  0  0  0
   30.7977   33.9390   19.9173 C   0  0  0  0  0  0  0  0  0  0  0  0
....
M  END
> <ShapeTanimoto>
0.6969

> <ColorTanimoto>
0.7854

> <TanimotoCombo>
1.7854

$$$$

score 0 · Answer 1 · answered Feb 05 '22 at 12:47

In my opinion the most 'efficient' way is to use someone else's code, an existing library.

The CDK can read SDF files, and collections thereof. https://cdk.github.io/

The Rosetta Wiki gives examples of calling the CDK from Python. https://ctr.fandom.com/wiki/Chemistry_Toolkit_Rosetta_Wiki

PySpark, read multiline file (.sdf)

1 Answers1