Extract Columns from a Protein Data Bank (PDB) Text File

Question

I want to make a plot with Matplotlib in Python and therefore read some data from a PDB-file (protein data bank). I want to extract every column from the file and store these columns in separate vectors. The PDB-file consists of columns with both text and floats. I'm very new to Matplotlib and I have tried several methods suggested to extract these columns, but nothing seem to work. What would be the best way to extract these columns? I'm going to load a lot of data in a later stage, so it's good if the method isn't too inefficient.

The PDB-files looks something like this:

ATOM      1  CA  MET A   1      38.012   8.932  -1.253
ATOM      2  CA  GLU A   2      39.809   5.652  -1.702
ATOM      3  CA  ALA A   3      43.007   5.013   0.368
ATOM      4  CA  ALA A   4      41.646   7.577   2.820
ATOM      5  CA  HIS A   5      42.611   4.898   5.481
ATOM      6  CA  SER A   6      46.191   5.923   5.090
ATOM      7  CA  LYS A   7      45.664   9.815   5.134
ATOM      8  CA  SER A   8      45.898  12.022   8.181
ATOM      9  CA  THR A   9      42.528  13.075   9.570
ATOM     10  CA  GLU A  10      43.330  16.633   8.378
ATOM     11  CA  GLU A  11      44.171  15.729   4.757
ATOM     12  CA  CYS A  12      40.589  14.150   4.745
ATOM     13  CA  LEU A  13      38.984  17.314   6.105
ATOM     14  CA  ALA A  14      40.633  19.053   3.220
ATOM     15  CA  TYR A  15      39.740  16.682   0.505
ATOM     16  CA  PHE A  16      36.138  17.421   1.566
ATOM     17  CA  GLY A  17      36.536  20.854   2.826
ATOM     18  CA  VAL A  18      34.184  20.012   5.553
ATOM     19  CA  SER A  19      34.483  20.966   9.177

Looks like you'll be working with numeric data, in which case [`numpy`](http://www.numpy.org/) is the de facto module to use. That or [`pandas`](http://pandas.pydata.org/), which is built on top of `numpy`. Have a look at [`np.genfromtxt`](http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html), which eats such delimited files for breakfast. Also, if you mention "nothing seem to work", it's a good idea on StackOverflow to show what you have tried and what errors you get... — Oliver W., Mar 18 '15 at 22:01
There are lots of Python packages out there which already handle PDBs. Check out [BioPython](http://biopython.org/wiki/Main_Page), [OpenMM](https://simtk.org/home/openmm) or [OpenBabel](http://openbabel.org/wiki/Python). Alternatively, if you're sure that your PDBs are going to be in the correct format then you can use [the specification](http://www.rcsb.org/pdb/static.do?p=file_formats/pdb/index.html) and pick out the relevant bits of each line. — Kyle_S-C, Mar 18 '15 at 23:00
I should add that PDB files from the databank get complicated too (different chain IDs, B factors, multiple possible atom positions) and the packages listed above seem to have `numpy` support, which is the standard, as @OliverW. suggests. — Kyle_S-C, Mar 18 '15 at 23:07

score 1 · Answer 1 · edited May 06 '15 at 09:51

The Protein Data Bank (pdb) file format is a textual file format describing the three-dimensional structures of molecules held in the Protein Data Bank. The pdb format accordingly provides for description and annotation of protein and nucleic acid structures including atomic coordinates, observed sidechain rotamers , secondary structure assignments, as well as atomic connectivity.I find this on google.

As for extracting column, you also can find the answer on google or wiki.

score 0 · Accepted Answer · answered Mar 19 '15 at 11:46

Going off of @Kyle_S-C's recommendation, here's a way to do it using Biopython.

First read your file into a Biopython Structure object:

import Bio.PDB
path = '/path/to/PDB/file' # your file path here
p = Bio.PDB.PDBParser()
structure = p.get_structure('myStructureName', path)

Then, for example, you can get a list of just the Atom ids like this:

ids = [a.get_id() for a in structure.get_atoms()]

See the Biopython Structural Bioinformatics FAQ for more, including the following methods for accessing the PDB columns for an Atom:

How do I extract information from an Atom object?

Using the following methods:

# a.get_name()           # atom name (spaces stripped, e.g. 'CA')
# a.get_id()             # id (equals atom name)
# a.get_coord()          # atomic coordinates
# a.get_vector()         # atomic coordinates as Vector object
# a.get_bfactor()        # isotropic B factor
# a.get_occupancy()      # occupancy
# a.get_altloc()         # alternative location specifier
# a.get_sigatm()         # std. dev. of atomic parameters
# a.get_siguij()         # std. dev. of anisotropic B factor
# a.get_anisou()         # anisotropic B factor
# a.get_fullname()       # atom name (with spaces, e.g. '.CA.')

score 0 · Answer 3 · answered Nov 05 '20 at 02:42

This tutorial might help: https://py-packman.readthedocs.io/en/latest/tutorials/molecule.html#tutorials-molecule

from packman import molecule

Protein = molecule.load_structure('/path/to/PDB/file.pdb')
#molecule.download_structure('1prw','1prw.pdb') if you want to download PDB file 1prw.pdb


for i in Protein[0].get_atoms():
    #Iterating over atom objects (parent= residue)
    print(i.get_name(), i.get_id(), i.get_location(), i.get_parent().get_name())

Provided above are way to get name of the atoms ie.. i.get_name(), id of the atoms ie.. i.get_id() etc.

It is possible to extract all the components of the PDB file. Please read the PACKMAN documentation for the details.

Disclosure: Author of the package py-packman

Extract Columns from a Protein Data Bank (PDB) Text File

3 Answers3