Where does BioPython store information related to various chemical molecules?

Question

If we reconstruct a protein from a PDB file, is it enough to have a PDB file, or do we need more info external to the PDB?

Take, for example, the BioPython framework. If any info is needed external to the PDB files, where does this framework store them?

Can I open and check to see those files?

The biopython PDB importer has a related publication: Hamelryck, T., Manderick, B. (2003) PDB parser and structure class implemented in Python. Bioinformatics 19: 2308–2310. Full text is available for free here: https://www.researchgate.net/figure/UML-Fowler-1999-diagram-of-the-SMCRA-data-structure-used-to-represent-the-atomic-data_fig1_8996610 — Ghoti, Nov 23 '22 at 15:27
Brief read through the paper, the parser does not reference any external files. It just reads a structured (PDB) file based on an expected framework. Main shortcoming is the fact that the modules only deal with the atomic data, and not with the information in the PDB header (which contains, e.g. information on refinement, space group, protein, etc.). Presumably these drawbacks are better addressed by the new standard protein structure file format, mmCIF. — Ghoti, Nov 23 '22 at 15:31
@Ghoti, I want to read PDB files and create an abstract data type of protein that mimics real-world protein, excluding H-bonds. — user366312, Nov 23 '22 at 15:50
Cross-posted on [bioinfo SE](https://bioinformatics.stackexchange.com/questions/20065/where-does-biopython-store-information-related-to-various-chemical-molecules) and [biostars](https://www.biostars.org/p/9546177/) — Ram RS, Nov 23 '22 at 16:40

pippo1980 · Answer 1 · 2023-08-28T17:05:32.113

Ok about Biopython, being all entities objects, you could store new properties/values adding them to the object itself. That properties wont be saved by PDBIO/MMCIFIO I believe , but for data persistence you could save object with pickle.

That said if you have a new value to add to an atom just use:

atom.new_property = value

Need to check if this can be done after object is instantiated (after loading a pdb with a PDB.Parser or at runtime or when creating an atom before adding to a residue)

[Not sure I am newbie to Python, but new_property could be described as a instance.attribute :

Adding attributes to a Python class is very straight forward, you just use the '. ' operator after an instance of the class with whatever arbitrary name you want the attribute to be called, followed by its value

Copied from Python — Dynamic Class Attributes

Class = Classes are essentially a template to create your objects. ]

pippo1980 · Answer 2 · 2022-11-23T22:36:41.767

If we reconstruct a protein from a PDB file, is it enough to have a PDB file, or do we need more info external to the PDB?

To my mind come up the missing hydrogens problem and the incomplete protein sequence, but I could be missing more

copied from https://www.umass.edu/microbio/chime/pe_beta/pe/protexpl/help_hyd.htm

Hydrogens in PDB files.

X-ray crystallography cannot resolve hydrogen atoms in most protein crystals, so in most PDB files, hydrogen atoms are absent. Sometimes hydrogens are added by modeling. Hydrogens are always present in PDB files resulting from NMR analysis, and usually present in theoretical models. For a brief introduction to X-ray crystallography, resolution, and NMR, see Nature of 3D Structural Data.

In proteins, the average number of hydrogens per non-hydrogen atom, weighted to take into account the frequencies of amino acids, is 1.01. Thus, hydrogens are ~50% of all atoms in protein. Nucleic acids have fewer, ~35%. High resolution protein crystallography (1.2 Ångstroms or less) can assign some hydrogen positions from the electron density map. Thus, the X-ray model of a tyrosine kinase SH2 domain 1lkk at 1.0 Angstrom resolution contains 902 hydrogens and 923 non-hydrogen protein atoms (ratio 0.98, 49%), so approximately all of the hydrogens actually present are assigned positions.

NMR methods also determine some hydrogen positions. Typically all hydrogens are modeled in before the molecule is folded to fit the NMR interatomic distance restraints; hence, all hydrogens are usually present in NMR models submitted to the PDB. The calmodulin ensemble of 25 NMR models 1cfc contains 1096 protein hydrogens and 1166 non-hydrogen protein atoms per model (ratio 0.94, 48.5%), thereby assigning positions for approximately all of the hydrogens actually present.

Most macromolecular crystals do not provide enough resolution to detect hydrogen positions. The X-ray model in PDB file 1hho for oxyhemoglobin (2.1 A resolution) contains no hydrogens, while the X-ray file 1lfa (1.8 A resolution; an integrin adhesion protein domain) contains 312 waters each with 2 hydrogens, and 645 protein hydrogens for 2,941 non-hydrogen protein atoms, accounting for only 22% of the hydrogens actually present in this protein. The protein hydrogens consist of one hydrogen on each backbone nitrogen (three hydrogens/amino terminal nitrogen), and hydrogens on sidechain oxygens or nitrogens in ser,thr,tyr, lys,arg,his, asn,gln. (None of the hydrogens covalently bonded to carbons are present.) The hydrogens which are present are required for the molecular dynamics stages of refinement of the X-ray model in the popular crystallographic refinement program X-PLOR; some authors strip them out before submitting a PDB file and others leave them in. (The Protein Data Bank accepts X-ray models either way, according to the preference of the depositor.)

Adding Hydrogens

If you wish to add hydrogens to a PDB file, see methods .........

References & Acknowledgements

Average protein hydrogens per non-hydrogen protein atom, weighted by average frequencies of amino acids, are based on 1,021 unrelated proteins of known sequence. Weights are tabulated on page 5 in Thomas E. Creighton's book "Proteins, Structures and Molecular Properties", 2nd ed. 1993, W. H. Freeman and Co.

Thanks to John Badger for contributing important information included in this document.

for the protein missing parts as an example you have program that could fill them comparing your protein to a similar complete one: https://salilab.org/modeller/wiki/Missing_residues, or other that could try to fill missing loops http://opig.stats.ox.ac.uk/webapps/fread/php/tutorial.php, no idea if you could use https://alphafold.ebi.ac.uk/ in same way, or could just get their structure

Better explained here:

https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/missing-coordinates-and-biological-assemblies

Where does BioPython store information related to various chemical molecules?

2 Answers2