How to calculate the average structure of a protein with multiple models/conformations

Question

I have a PDB file '1abz' (https://files.rcsb.org/view/1ABZ.pdb), containing the coordinates of a protein structure with 23 different models (numbered MODEL 1-23). Please ignore the header remarks, the interesting information starts at line 276 which says 'MODEL 1'.

I'd like to calculate the average structure of a protein. The protein's PDB file contains multiple conformations/models and I'd like to calculate the average coordinates for the individual atoms per residue, such that I end up with one conformation/model.

I couldn't figure out how to do it using Biopython so I tried to calculate the average coordinates using Pandas. I think I have managed to calculate the average but the problem now is that I have a csv file which is no longer in PDB format, so I can't load this file into PyMol.

My questions are, how can I convert my csv file to PDB format. Better still, how can I get the average coordinates in Biopython or in Python without compromising the original pdb file format?

Here is the code that I used to calculate the average coordinates in pandas.

#I first converted my pdb file to a csv file

import pandas as pd
import re

pdbfile = '1abz.pdb'
df = pd.DataFrame(columns=['Model','Residue','Seq','Atom','x','y','z']) #make dataframe object
i = 0 #counter

b = re.compile("MODEL\s+(\d+)")
regex1 = "([A-Z]+)\s+(\d+)\s+([^\s]+)\s+([A-Z]+)[+-]?\s+([A-Z]|)"
regex2 = "\s+(\d+)\s+([+-]?\d+\.\d+\s+[+-]?\d+\.\d+\s+[+-]?\d+\.\d+)"
reg = re.compile(regex1+regex2)

with open(pdbfile) as f:
    columns = ('label', 'ident', 'atomName', 'residue', 'chain', 'sequence', 'x', 'y', 'z', 'occ', 'temp', 'element')
    data = []
    for line in f:
        n = b.match(line)
        if n:
            modelNum = n.group(1)

        m = reg.match(line)
        if m:
            d = dict(zip(columns, line.split()))
            d['model'] = int(modelNum)
            data.append(d)

df = pd.DataFrame(data)
df.to_csv(pdbfile[:-3]+'csv', header=True, sep='\t', mode='w')

#Then I calculated the average coordinates

df = pd.read_csv('1abz.csv', delim_whitespace = True, usecols = [0,5,7,8,10,11,12]) 
df1 = df.groupby(['atomName','residue','sequence'],as_index=False)['x','y','z'].mean()
df1.to_csv('avg_coord.csv', header=True, sep='\t', mode='w')

also provide us table sample in the question (no one likes to go to external links) without noisy lines — Azat Ibrakov, Mar 19 '18 at 05:06

rodgdor · Accepted Answer · 2018-03-19T23:14:19.983

This is certainly doable in biopython. Let me help you with an example ignoring the HETRES in the pdb file:

First parse the pdb file with all your models:

import Bio.PDB
import numpy as np

parser = Bio.PDB.PDBParser(QUIET=True)  # Don't show me warnings
structure = parser.get_structure('1abz', '1abz.pdb')  # id of pdb file and location

Okay, now that we have the content of the file, and supposing you have the same atoms in all your models, get a list with a unique identifier for each atom (e.g: chain + residue pos + atom name):

atoms = [a.parent.parent.id + '-' + str(a.parent.id[1]) + '-' +  a.name for a in structure[0].get_atoms() if a.parent.id[0] == ' ']  # obtained from model '0'

Note that I am ignoring hetero residues with a.parent.id[0] == ' '. Now let's get the average for each atom:

atom_avgs = {}
for atom in atoms:
    atom_avgs[atom] = []
    for model in structure:
        atom_ = atom.split('-')
        coor = model[atom_[0]][int(atom_[1])][atom_[2]].coord
        atom_avgs[atom].append(coor)
    atom_avgs[atom] = sum(atom_avgs[atom]) / len(atom_avgs[atom])  # average

Now let's create the new pdb using the one model of the structure:

ns = Bio.PDB.StructureBuilder.Structure('id=1baz')  #  new structure
ns.add(structure[0])  # add model 0
for atom in ns[0].get_atoms():
    chain = atom.parent.parent
    res = atom.parent
    if res.id[0] != ' ':
        chain.detach_child(res)  # detach hetres
    else:
        coor = atom_avgs[chain.id + '-' + str(res.id[1]) + '-' + atom.name]
        atom.coord = coor

Now let's write the pdb

io = Bio.PDB.PDBIO()
io.set_structure(ns)
io.save('new_1abz.pdb')

A million thanks for your help AND for taking the time to explain your code! Just a very minor error that I noticed. I think it should be `structure = parser.get_structure('1abz', '1abz.pdb')`. I got a "TypeError: 'PDBParser' object is not callable" when I ran your code with `structure = parser('1abz', '1abz.pdb')`. In any case, thanks once again for your help! It's very much appreciated! — Cave, Mar 19 '18 at 22:43
Indeed you are right (I edited the code), otherwise the code worked? — rodgdor, Mar 19 '18 at 23:14

How to calculate the average structure of a protein with multiple models/conformations

1 Answers1