1

I am trying to parse PDB files.

Say, a PDB file has the following data:

ATOM     33  N  ATHR A   2       4.935 -11.632  15.046  0.74  2.95           N
ATOM     34  N  BTHR A   2       5.078 -11.406  15.180  0.31  2.78           N  
ATOM     35  CA ATHR A   2       5.757 -11.521  13.850  0.81  3.02           C  
ATOM     36  CA BTHR A   2       5.773 -11.153  13.921  0.20  2.67           C  
ATOM     37  C  ATHR A   2       7.070 -10.839  14.210  0.74  2.82           C  
ATOM     38  C  BTHR A   2       7.155 -10.559  14.193  0.29  1.80           C  
ATOM     39  O  ATHR A   2       7.152  -9.941  15.050  0.80  3.31           O  
ATOM     40  O  BTHR A   2       7.214  -9.641  15.012  0.25  2.41           O  
ATOM     41  CB ATHR A   2       4.976 -10.693  12.813  0.87  5.53           C  
ATOM     42  CB BTHR A   2       4.896 -10.354  12.941  0.25 12.07           C  
ATOM     43  OG1ATHR A   2       4.611  -9.432  13.388  1.00  6.88           O  
ATOM     44  OG1BTHR A   2       3.743 -11.083  12.501  0.25  9.57           O  
ATOM     45  CG2ATHR A   2       3.858 -11.584  12.293  0.75 10.03           C  
ATOM     46  CG2BTHR A   2       5.683  -9.885  11.726  0.27  5.90           C  
ATOM     47  H  ATHR A   2       4.547 -10.814  15.527  0.75  3.44           H  
ATOM     48  H  BTHR A   2       5.510 -10.211  15.754  0.25  2.90           H  
ATOM     49  HA ATHR A   2       5.962 -12.339  13.548  0.75  3.32           H  
ATOM     50  HA BTHR A   2       4.036  -9.929  13.477  0.25  2.86           H  
ATOM     51  HB ATHR A   2       5.648 -10.589  11.938  0.75  5.43           H  
ATOM     52  HB BTHR A   2       4.644  -9.326  13.574  0.25  5.67           H  
ATOM     53  HG1ATHR A   2       5.030  -9.344  14.216  0.75  8.74           H  
ATOM     54  HG1BTHR A   2       3.236 -11.198  13.399  0.25 10.21           H  
ATOM     55 HG21ATHR A   2       4.096 -12.441  11.924  0.75 10.92           H  
ATOM     56 HG21BTHR A   2       6.542  -9.278  12.024  0.25  9.66           H  
ATOM     57 HG22ATHR A   2       3.222 -10.974  11.650  0.75 10.92           H  
ATOM     58 HG22BTHR A   2       5.039  -9.142  11.179  0.25  9.66           H  
ATOM     59 HG23ATHR A   2       3.163 -11.738  13.200  0.75 10.92           H  
ATOM     60 HG23BTHR A   2       5.904 -10.639  11.169  0.25  9.66           H  

We see that there are many alternative atoms in the 2nd residue.

How should I choose the atoms?

Should I just randomly choose one atom from the alternatives? How can my parser learn to differentiate between those two atoms (say, 33 and 34)? Coz, from the parsing point of view, there is no indication that they are related.

How can I parse alternative atom information in a PDB file?

user366312
  • 16,949
  • 65
  • 235
  • 452
  • If you do not get an answer here quickly, try posting also on [Bioinformatics Stack Exchange](https://bioinformatics.stackexchange.com/) or [Biostars](https://www.biostars.org/). – Timur Shtatland Jun 27 '22 at 21:24
  • You can either use a library for parsing PDB files (biopython, BioJava, cctbx, gemmi, …) or read the format specification. – marcin Jun 27 '22 at 21:41

2 Answers2

0

The column after the x,y,z coordinates is the occupancy. One is 0.74 and the other is 0.31. As a fraction of the molecules that have the conformation. If it's found there in all the structures, the value would be 1.00

Edit:

I use cctbx in python. I have to apologize that it was literally working 2 days ago but when I went to write you an example, the package was not recognized. I've been having issue getting it to work again. I'm using windows and my just move to Linux for this issue. cctbx has filters where you can separate those sets. The A,B, etc is the altloc position.

Basic format is open file. Isolate hierarcy, extract atoms, create truth table with search and select on truth table.

I've not done it explicitly, but you should be able to search by altloc reference: http://cci.lbl.gov/docs/cctbx/doc_models_hierarchy/

from __future__ import absolute_import, division, print_function`
from iotbx.data_manager import DataManager`
import pandas as pd`
dm = DataManager()                   #   Initialize the DataManager and call it dm
dm.set_overwrite(True)                #   tell the DataManager to overwrite files with the same name
model_filename = "./files/1vir.pdb"    #   Name of model file`
m = dm.get_model(om_model_filename1)  #   Deliver model object with model info
pdb_hierarchy = m.get_hierarchy() # Get hierarchy object
pdb_atoms = pdb_hierarchy.atoms() # get atoms
sites_cart = m.get_sites_cart() # get atom list
sel_cache = pdb_hierarchy.atom_selection_cache()
#this next line may not be exactly right but along these lines, see reference
c_alpha_alpha_protein = sel_cache.selection("altloc A") # extract truth table
protein = sites_cart.select(c_alpha_sel_protein)
protein = sites_cart.select(c_alpha_sel_protein)
protein = pd.DataFrame( [i for i in protein], columns=['x', 'y', 'z'] )

# other examples that I know work`
# you can use other search parameters to set to isolate other components like solvents, ligands, etc.
c_alpha_sel_non_protein = sel_cache.selection("hetero") # isolate all non protein atom (ligand, H2O, AcOH...) XXX not case sensitive!
c_alpha_sel_proteinplus_not_ligand = sel_cache.selection("not resname 2ow") # XXX not case sensitive!
c_alpha_sel_ligand = sel_cache.selection("resname 2ow") # extract ligand truth table object

There are others that have done their own python versions. Just browsing have have not used:

selaltloc on Github

prody

If I get my packages up again soon, I'll try to put an example together proper if you've not gotten it by then.

  • Thanks for the response, could you elaborate with an example plz? – user366312 Sep 13 '22 at 17:52
  • Could you add python highlighting (**```python**) to your code? The automatic determination did not work. – Vovin Sep 18 '22 at 12:12
  • Please see my new response. Prody works better because it's easier to install. Even if cctbx installs correctly, it may not work. My version worked from an installer from the gov website which no longer contains the installer. – Social Idiot Sep 18 '22 at 19:39
0

Since my cctbx went fubar and they no longer have the installation file, I moved to prody which does very much the same thing and installs super easy ( pip install ProDy ). So the example of selection with prody is below:

from prody import *
pdb_file = './files/1zir.pdb'
pdb = parsePDB(pdb_file)
try:
  altloc = pdb.select('protein and altloc _ A')
  repr(altloc)
  if altloc:
    writePDB('./files/_altlocs.pdb', altloc)
except: pass

If you just want the 'A' altloc you would simply .select( 'altloc A' ) the '_' selects the empty altlocs and protein selects the whole protein and excludes ligands. This is nice because you can select things like 'ligand and not water' or 'resname 2OW'

Hope this helps. I'm not sure what your exact goal is. These files would then be super easy to manipulate with pandas which is what I'm doing.