Biopython: return chain but with the new chain ID already

Question

I have script which can extract selected chains from a structure into a new file. I do it for 400+ structures. Because chainIDs of my selected chains can differ in the structures, I parse .yaml files where I store the corresponding chainIDs. This script is working, everything is fine but the next step is to rename the chains to be the same in each file. I used edited code from here:this. Basically it worked as well, however the problem is that e.g. my new chainID of chain1 is the same as original chainID of chain2, and the error occurrs:Cannot change id from U to T. The id T is already used for a sibling of this entity. Actually, this happened for many variables and it'd be too complicated doing it manually.

I've got idea that this could be solved by renaming the chainIDs right in the moment when I'm extracting it. Is it possible using Biopython like that? Could'nt find anything similar to my problem. Simplified code for one structure (in the original one is one more loop for iterating over 400+ structures and its .yaml files):

with open(yaml_file, "r") as file:
        proteins = yaml.load(file, Loader=yaml.FullLoader)
        chain1= proteins["1_chain"].split(",")[0] #just for illustration that I have to parse the original chainIDs
        chain2= proteins["2_chain"].split(",")[0]
       
        structure = parser.get_structure("xxx", "xxx.cif" )[0]
        
        for model in structure:
            for chain in model:
               class ChainSelect(Select):
                    def accept_chain(self, chain):
                        if chain.get_id() == '{}'.format(chain1):
                            return True # I thought that somewhere in this part could be added command renaming the chain to "A"
                        if chain.get_id() == '{}'.format(chain2):
                            return True #here I'd rename it "B"
                        else:
                            return False

        io = MMCIFIO()
        io.set_structure(structure)
        io.save("new.cif" , ChainSelect())

Is it possible to somehow expand "return" command in a way that it would return the chain with desired chainID (e.g. A)? Note that the original chain ID can differ in the structures (thus I have to use .format(chainX))

I don't have any other idea how I'd get rid of the error that my desired chainID is already in sibling entity.

Would creating a new structure entity for each chain extracted and then renaming it with the desidered ID be a feasible way of preventing the error ? See as an example : https://stackoverflow.com/questions/11685716/how-to-extract-chains-from-a-pdb-file — pippo1980, Aug 07 '22 at 14:20
Firstly, during the extraction I had to rename it with some unique numbers (e.g. chain 76) which are not used anywhere else ever. Then I rename each of the unique numbers with my desired chain ID, e.g. each chain 76 is now chain A. It's not very elegant way how to solve this problem, but couldn't come with something better at that time I needed to do this. — HungryMolecule, Aug 07 '22 at 17:15
In the link you posted, they suggested first changing A-> a , B -> b and so on , and then again from lowercase to uppercase depending on your needs a-> B, b -> A, apparently chain.id is case sensitive — pippo1980, Aug 07 '22 at 17:41
Does biopython accepts double digits chain.id : https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py#L353 : if len(chain_id) > 1: e = f"Chain id ('{chain_id}') exceeds PDB format limit." raise PDBIOException(e) ??? — pippo1980, Aug 07 '22 at 17:51
Think you'll need to change PDBIO.save see : https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py#L299 — pippo1980, Aug 08 '22 at 14:34
Actually the above won' t work the error cames from.here : https://github.com/biopython/biopython/blob/master/Bio/PDB/Entity.py#L163 The check is in line 175 and the following line raises the same ValueError as mentioned by you — pippo1980, Aug 08 '22 at 19:45
Yeah, renaming more chains at once using Biopython can be a bit difficult, especially when the structures are big (e.g. ribosomes) and have many chains. The more chains it has, the higher chance that your desired ID is used somewhere else (even more difficult if you work with multiple structures and want to do it at once for all of them). I've got used to parsing ProteinDataBank webpage for chain IDs (and other information) of each structure and store them in yaml/json and then use them for other editing of my structures, such as renaming... — HungryMolecule, Aug 10 '22 at 08:48
PDBx should handle bigger len() chain.id , think Chimera was able to handle them well when I was reconstructing full sized bacterial ichosaedral virus like particle from crystlographyc asymetric unit ,now with cryoEM they submit the entire capsid with all the chain assigned properly thANks to PDBx format — pippo1980, Aug 10 '22 at 14:03

Biopython: return chain but with the new chain ID already

0 Answers0