2

I need to extract single chains from a structure file in cif format as available from the PDB. I've read several related questions, such as this and this. The proposed solution indeed works well if the chain ID is an integer or a single character. If applied to a structure such as 6KMW to extract chain aA it raises the error TypeError: %c requires int or char. Full code used to reproduce the error and output included below.

from Bio.PDB import PDBList, PDBIO, FastMMCIFParser, Select

class ChainSelect(Select):
    def __init__(self, chain):
        self.chain = chain
    def accept_chain(self, chain):
        if chain.get_id() == self.chain:
            return 1
        else:          
            return 0
        
pdbl = PDBList()
io = PDBIO()
parser = FastMMCIFParser(QUIET = True)

pdbl.retrieve_pdb_file('6kmw', pdir = '.', file_format='mmCif')
structure = parser.get_structure('6kmw', '6kmw.cif')
io.set_structure(structure)
io.save('6kmw_aA.pdb', ChainSelect('aA'))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-095b98a12800> in <module>
     18 structure = parser.get_structure('6kmw', '6kmw.cif')
     19 io.set_structure(structure)
---> 20 io.save('6kmw_aA.pdb', ChainSelect('aA'))

~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in save(self, file, select, write_end, preserve_atom_numbering)
    368                                     )
    369 
--> 370                             s = get_atom_line(
    371                                 atom,
    372                                 hetfield,

~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in _get_atom_line(self, atom, hetfield, segid, atom_number, resname, resseq, icode, chain_id, charge)
    227                 charge,
    228             )
--> 229             return _ATOM_FORMAT_STRING % args
    230 
    231         else:

TypeError: %c requires int or char

Is anyone aware of a Biopython functionality to achieve the result? Preferably one that doesn't rely on parsing the entire file by custom functions.

saiden
  • 322
  • 3
  • 14

1 Answers1

2

I think, what you are trying to achieve is just impossible. Effectively you want to convert a cif file to a pdb file. It does not matter that you want to reduce the protein structure to a single chain in the process. The PDB format is a file format from the last century. (I know how widely spread it is till today...) It is column oriented and only allows for one character for the chain id. This is the reason you cannot download a PDB file for protein 6KMW. See the tooltip at https://www.rcsb.org/structure/6KMW for that: "PDB format files are not available for large structures". In your case "large" means, proteins with so many chains that they need two characters.

You cannot store two characters as the chain name for a PDB file. You got two options now:

  • Rename the chain "aA" and save the file in PDB format
  • Don't use the PDB format as your file format but stick to cif

This snippet renames the chain and stores the structure as a pdb file:

[...]
io.set_structure(structure)
for model in structure:
    for chain in model:
        if chain.get_id() == "A":
            chain.id = "_"
            print("renamed chain A to _")
        if chain.get_id() == "aA":
            chain.id = "A"
            print("renamed chain aA to A")

io.save('6kmw_aA.pdb', ChainSelect('A'))

This snippet stores only chain 'aA' in mmCIF format:

from Bio.PDB.mmcifio import MMCIFIO

io = MMCIFIO()

io.set_structure(structure)
io.save("6kmw_aA.cif", ChainSelect('aA'))
Lydia van Dyke
  • 2,466
  • 3
  • 13
  • 25
  • Please let me add: Whatever your current project is, you are working on: I wish you all of the best of luck and much success. Greetings from a former bio chemist. :) – Lydia van Dyke Sep 23 '20 at 21:27
  • Thank you very much for the answer, now the issue is a little bit clearer to me. To your knowledge is there any way to extract a chain with Biopython from a `cif` 'format and save it without ever using the `pdb` file format, keeping the original chain ID? I've also tried to import the cif file directly and subset the chain with `chain = model['aA']` but it still raises the error. p.s. the issue with the chain A being non-existent is because many of these new cryoEM structures have all double or even triple characters for the chainID, typically AAA or aA. – saiden Sep 24 '20 at 10:54
  • @saiden: please see my second code snippet that I added to the reply. – Lydia van Dyke Sep 24 '20 at 17:16
  • thanks again. your snippet works but unfortunately the written file is different from the original (come atom columns are missing), which makes it not suitable for `mkdssp`. – saiden Sep 25 '20 at 14:03
  • Which columns are missing? Sounds like a valid follow-up question on SO... Maybe it is worth a bug-report to Biopython? – Lydia van Dyke Sep 25 '20 at 20:32
  • I posted the follow-up here: https://bioinformatics.stackexchange.com/questions/14431/using-dssp-after-chain-extraction, because I thought it was more appropriate. I don't know if it's really a bug or an error on my part. – saiden Sep 26 '20 at 10:06
  • As your question got 3 upvotes there (including one from me) I would assume it is the appropriate location :) – Lydia van Dyke Sep 26 '20 at 18:12
  • 1
    Finally opened an issue on GitHub: https://github.com/biopython/biopython/issues/3438 – saiden Dec 09 '20 at 10:02