I have a PDB file '1abz' (https://files.rcsb.org/view/1ABZ.pdb), containing the coordinates of a protein structure with 23 different models (numbered MODEL 1-23). Please ignore the header remarks, the interesting information starts at line 276 which says 'MODEL 1'.
I'd like to calculate the average structure of a protein. The protein's PDB file contains multiple conformations/models and I'd like to calculate the average coordinates for the individual atoms per residue, such that I end up with one conformation/model.
I couldn't figure out how to do it using Biopython so I tried to calculate the average coordinates using Pandas. I think I have managed to calculate the average but the problem now is that I have a csv file which is no longer in PDB format, so I can't load this file into PyMol.
My questions are, how can I convert my csv file to PDB format. Better still, how can I get the average coordinates in Biopython or in Python without compromising the original pdb file format?
Here is the code that I used to calculate the average coordinates in pandas.
#I first converted my pdb file to a csv file
import pandas as pd
import re
pdbfile = '1abz.pdb'
df = pd.DataFrame(columns=['Model','Residue','Seq','Atom','x','y','z']) #make dataframe object
i = 0 #counter
b = re.compile("MODEL\s+(\d+)")
regex1 = "([A-Z]+)\s+(\d+)\s+([^\s]+)\s+([A-Z]+)[+-]?\s+([A-Z]|)"
regex2 = "\s+(\d+)\s+([+-]?\d+\.\d+\s+[+-]?\d+\.\d+\s+[+-]?\d+\.\d+)"
reg = re.compile(regex1+regex2)
with open(pdbfile) as f:
columns = ('label', 'ident', 'atomName', 'residue', 'chain', 'sequence', 'x', 'y', 'z', 'occ', 'temp', 'element')
data = []
for line in f:
n = b.match(line)
if n:
modelNum = n.group(1)
m = reg.match(line)
if m:
d = dict(zip(columns, line.split()))
d['model'] = int(modelNum)
data.append(d)
df = pd.DataFrame(data)
df.to_csv(pdbfile[:-3]+'csv', header=True, sep='\t', mode='w')
#Then I calculated the average coordinates
df = pd.read_csv('1abz.csv', delim_whitespace = True, usecols = [0,5,7,8,10,11,12])
df1 = df.groupby(['atomName','residue','sequence'],as_index=False)['x','y','z'].mean()
df1.to_csv('avg_coord.csv', header=True, sep='\t', mode='w')