-1

5

0001 -417.031

C 1.04168, -0.05620, -0.07148 1.041682, -0.056200, -0.071481

H 2.15109, -0.05620, -0.07150 2.130894, -0.056202, -0.071496

H 0.67187, 0.17923, -1.09059 0.678598, 0.174941, -1.072044

H 0.67188, 0.70866, 0.64196 0.678613, 0.694746, 0.628980

H 0.67188, -1.05649, 0.23421 0.678614, -1.038285, 0.228641

8

0002 -711.117

C 0.99571, 0.01149, -0.09922 0.995914, 0.011511, -0.099221

C 2.51489, 0.01148, -0.09922 2.514686, 0.011466, -0.099226

H 0.61911, 0.74910, -0.83887 0.597259, 0.729877, -0.819596

H 0.61911, 0.28325, 0.90938 0.597259, 0.276170, 0.883106

H 0.61909, -0.99785, -0.36818 0.597278, -0.971531, -0.361167

H 2.89151, 1.02083, 0.16973 2.913322, 0.994509, 0.162719

H 2.89149, -0.26027, -1.10783 2.913341, -0.253192, -1.081553

H 2.89149, -0.72612, 0.64042 2.913341, -0.706900, 0.621148

These two data points are from chemical database GDB 13. I try to understand what these numbers are representing. I know 5 and 8 are atomic number; 0001 and 0002 are atom id; and -417.031 and 711.117 are atomization energies. However, I don't quite understand what those number below means. However, I am pretty sure they are the geometry representation in 3 dimension space. If that is the geometry representation in 3 dimension space, then why there are 6 numbers in there. How to read those 6 numbers?

I am also trying to use BOB representation to reform the data, is there any ways to do that instead of hard coding? If not, I am using R, is R able to do that ?

1 Answers1

0

Have a look at the original paper in ‎Int. J. Quantum Chem., 2015, 115, 1058-1073 (DOI).

The Extended XYZ format is explained in Fig. 7 of the article.

You are right that the first line denotes the number of atoms k, while the second line consists of an identifier and the energy of atomization for the particular molecule.

The next k lines contain two sets of cartesian coordinates (in Angström). The left block contains the x,y,z coordinates from a force-field calculation (UFF), while the coordinates on the right stem from a DFT calculation.

A common tool to read and convert coordinate files in various formats is Open Babel. Have a look at th accompanying paper in J. Cheminformatics, 2013,3:33 (DOI)

There exist various bindings for Open Babel, and apparently, there is is one for r too. Have a look.

I just ran a quick test on the first entry in the data from the supplement of the paper by Mathias Rupp using Open Babel 2.3.2:

obabel -ixyz c1.xyz -oxyz -O c1a.xyz

Apparently, only the left coordinate block is read in! If you suspect that the coordinates from UFF and DFT calculations differ significantly, you're probably on your own. However, given that the file format is documented, this should not be a major problem.


If you don't mind a remark, the title of your question is somewhat misleading. The data in question is only remotely related to GDB-13. To my knowledge, the GDB files from Jean-Louis Reymond do not contain any coordinates. They are large collections SMILES strings, from which coordinates would have to be generated for each entry.

Klaus-Dieter Warzecha
  • 2,265
  • 2
  • 27
  • 33