Extract multiple chains from the CIF file (based on specified selections) and save them as one PDB

Question

I have several CIF files, for each of them I'd like to extract 3 chains and save the three chains in one PDB file. Problem could be that some CIF structures can contain different chain-IDs of my desired chains: I solved it the way that I parsed RCSB webpage with Beautiful Soup and save the chain-IDs for each CIF file in corresponding yaml file. Then I open each CIF file as well as its yaml file, then I get correct chainIDs for each CIF from its yaml file.

Now I'd like to extract all three chains and save them. I tried to use something from similar questions this. However, the last line of my code (with io.save) returns error: IndexError: tuple index out of range

code:

cif_input = "AAA","BBB","CCC"
for ID in cif_input:
    cif_file = "{}.cif" .format(ID) #these are my CIF files
    
    yaml_file_protein = "{}.yaml" .format(ID) #these are yaml files for each of the CIF files, they contain information about chain IDs
    with open(yaml_file_protein, "r") as file_protein:
        proteins = yaml.load(file_protein, Loader=yaml.FullLoader) #loading the yaml file
        chain_1 = proteins["chain1"] #here I get information what ID has chain no.1, result in AAA.cif is for example: U
        chain_2 = proteins["chain2"] #here I get information what ID has chain no.2
        chain_3 = proteins["chain3"] #here I get information what ID has chain no.3
                
        structure = parser.get_structure("{}" .format(ID), "{}.cif" .format(ID))[0] #parsing corresponding structure (CIF file)
        
        #this is what I tried implement from similar questions: I need to extract all three chains and save them in one PDB, this have to be done for all CIF files
        class ChainSelect(Select):
            def accept_chain(self, chain):
                if chain.get_id()=='{}'.format(chain_1): #this is what I should select chainID of chain_1 based on what is specified above from the yaml file
                    return True
                if chain.get_id()=='{}'.format(chain_2):
                    return True
                if chain.get_id()=='{}'.format(chain_3):
                    return True
                else:
                    return False
        
        io = PDBIO()
        io.set_structure(structure)
        io.save("{}_selection.pdb" .format(ID), ChainSelect(chain))

I assume the problem will be somewhere in the "class ChainSelect..." but can't figure out what exactly could be causing such error. I'd be very helpful for any suggestions.

Edit: I also thought the problem could be in lines if chain.get_id()=='{}'.format(chain_X):, because it'd return e.g. U and not "U", thus I tried to edit it as if chain.get_id()==" '{}' ".format(chain_X): which return "U" , however the error is still the here. I tried this code with only one chain as well, but still the same error...

Your imports are missing, are you importing Select from.biopython ? — pippo1980, Aug 01 '22 at 14:32
Discard my comment above, see here https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ — pippo1980, Aug 01 '22 at 14:51
chain.get_id()=='{}'.format(chain_1) are the two things(!!!!) The same type() ???? — pippo1980, Aug 01 '22 at 14:58
ok think I got it ChainSelect(chain)) is wrong ChainSelect(Select)) subclassing BIO.Select() doesnt take any parameters unless you change its init method too. what does using: "io.save("{}_selection.pdb" .format(ID), ChainSelect())" result into ? — pippo1980, Aug 01 '22 at 15:32

pippo1980 · Accepted Answer · 2022-08-01T15:53:40.037

this code of mine works, reads '2ms2.pdb' , pdb that has Chains A,B,C

and saves "prova.pdb" with only Chains A,C. Without headers


from Bio.PDB import PDBParser, PDBIO, Select



parser=PDBParser(QUIET=True)


structure_1 = parser.get_structure('test', '2ms2.pdb')



for model in structure_1:
    
    print(model)
    
    for chain in model:
        
        print(chain, chain.get_id(), type(chain.get_id()))
        
        class ChainSelect(Select):
                    def accept_chain(self, chain):
                        if chain.get_id() == 'A':
                            return True
                    
                        if chain.get_id() == 'C':
                            return True
                        else:
                            return False
        
io = PDBIO()
io.set_structure(structure_1)
io.save("prova.pdb" , ChainSelect())

not sure where best place to put

class ChainSelect(Select):
            def accept_chain(self, chain):
                if chain.get_id() == 'A':
                    return True
            
                if chain.get_id() == 'C':
                    return True
                else:
                    return False

in this code though

Instruction here: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

Can I write PDB files?

Use the PDBIO class for this. It’s easy to write out specific parts of a structure too, of course.

Example: saving a structure

io = PDBIO()
io.set_structure(s)
io.save("out.pdb")

If you want to write out a part of the structure, make use of the Select class (also in PDBIO). Select has four methods:

accept_model(model)
accept_chain(chain)
accept_residue(residue)
accept_atom(atom)

By default, every method returns 1 (which means the model/chain/residue/atom is included in the output). By subclassing Select and returning 0 when appropriate you can exclude models, chains, etc. from the output. Cumbersome maybe, but very powerful. The following code only writes out glycine residues:

class GlySelect(Select):
    def accept_residue(self, residue):
        if residue.get_name() == "GLY":
            return 1
        else:
            return 0


io = PDBIO()
io.set_structure(s)
io.save("gly_only.pdb", GlySelect())

If this is all too complicated for you, the Dice module contains a handy extract function that writes out all residues in a chain between a start and end residue.

So, it seems that the problem was that I skipped the looping through model and chain (`for model in structure: ...for chain in model: ` ), when I added this into my code (as you did here), then the class ChainSelect() worked — HungryMolecule, Aug 01 '22 at 18:15
Maybe I shouldn't use `ChainSelect(chain) ` (in my last line) as well. Using just `ChainSelect() ` worked. Thanks for help. — HungryMolecule, Aug 01 '22 at 18:21
"looping through model and chain" is not needed, it was for me, dont use biopython often and PDB even less, just to remember how the parser works. have a look here: https://biopython.org/docs/1.76/api/Bio.PDB.PDBIO.html and how PDBIO.save works https://biopython.org/docs/1.76/api/Bio.PDB.PDBIO.html#Bio.PDB.PDBIO.PDBIO.save — pippo1980, Aug 01 '22 at 19:52
here if you want tofigure out more,its not easy to me to understand the logic but: https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py Class Select is at row34 https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py#L34 — pippo1980, Aug 01 '22 at 20:22
See here: https://www.rcsb.org/docs/general-help/structures-without-legacy-pdb-format-files :PDB entries without PDB-format files Several type of PDB entries are not offered in the legacy PDB format anymore: 1 Entries containing multiple character chain ids 2 Entries containing > 62 chains 3 Entries containing > 99999 ATOM coordinates 4 Entries that have complex beta sheet topology, see more details — pippo1980, Aug 08 '22 at 04:09

Extract multiple chains from the CIF file (based on specified selections) and save them as one PDB

1 Answers1