1

I need to extract specific chains from PDB files( Sometiems more than one chain). How to extract chains from a PDB file?. It's the same question and "marked" answer, answers my problem. But it does not work in python 3. It gives errors one after the other. Does anybody knows how can i work this in python 3?

Or any other code for the same kind of problem

Thank you in advance.

import os
from Bio import PDB


class ChainSplitter:
    def __init__(self, out_dir=None):
        """ Create parsing and writing objects, specify output directory. """
        self.parser = PDB.PDBParser()
        self.writer = PDB.PDBIO()
        if out_dir is None:
            out_dir = os.path.join(os.getcwd(), "chain_PDBs")
        self.out_dir = out_dir

    def make_pdb(self, pdb_path, chain_letters, overwrite=False, struct=None):
        """ Create a new PDB file containing only the specified chains.

        Returns the path to the created file.

        :param pdb_path: full path to the crystal structure
        :param chain_letters: iterable of chain characters (case insensitive)
        :param overwrite: write over the output file if it exists
        """
        chain_letters = [chain.upper() for chain in chain_letters]

        # Input/output files
        (pdb_dir, pdb_fn) = os.path.split(pdb_path)
        pdb_id = pdb_fn[3:7]
        out_name = "pdb%s_%s.ent" % (pdb_id, "".join(chain_letters))
        out_path = os.path.join(self.out_dir, out_name)
        print ("OUT PATH:",out_path)
        plural = "s" if (len(chain_letters) > 1) else ""  # for printing

        # Skip PDB generation if the file already exists
        if (not overwrite) and (os.path.isfile(out_path)):
            print("Chain%s %s of '%s' already extracted to '%s'." %
                    (plural, ", ".join(chain_letters), pdb_id, out_name))
            return out_path

        print("Extracting chain%s %s from %s..." % (plural,
                ", ".join(chain_letters),  pdb_fn))

        # Get structure, write new file with only given chains
        if struct is None:
            struct = self.parser.get_structure(pdb_id, pdb_path)
        self.writer.set_structure(struct)
        self.writer.save(out_path, select=SelectChains(chain_letters))

        return out_path


class SelectChains(PDB.Select):
    """ Only accept the specified chains when saving. """
    def __init__(self, chain_letters):
        self.chain_letters = chain_letters

    def accept_chain(self, chain):
        return (chain.get_id() in self.chain_letters)


if __name__ == "__main__":
    """ Parses PDB id's desired chains, and creates new PDB structures. """
    import sys
    if not len(sys.argv) == 2:
        print ("Usage: $ python %s 'pdb.txt'" % __file__)
        sys.exit()

    pdb_textfn = sys.argv[1]

    pdbList = PDB.PDBList()
    splitter = ChainSplitter("/home/patrick/Desktop/chain_splitting")

    with open(pdb_textfn) as pdb_textfile:
        for line in pdb_textfile:
            pdb_id = line[:4].lower()
            chain = line[4]
            pdb_fn = pdbList.retrieve_pdb_file(pdb_id)
            splitter.make_pdb(pdb_fn, chain)
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • Can you add the code which gave you the errors? Preferably a [MCVE]. – Maximilian Peters Aug 18 '19 at 17:01
  • @MaximilianPeters I have updated the full script. Actually I m getting errors in Bioputhon Module. Not in this script. I'm kind of new to prorammming/ –  Aug 19 '19 at 06:33
  • First i was getting this error File "/home/patrick/anaconda3/lib/python3.7/site-packages/Bio/PDB/PDBParser.py", line 167, in _parse_coordinates resseq = int(line[22:26].split()[0]) # sequence identifier ValueError: invalid literal for int() with base 10: 'LYS'. I changed (int) to (str). But then I was getting another error –  Aug 19 '19 at 06:34
  • This was the second error. File "/home/sagara/anaconda3/lib/python3.7/site-packages/Bio/PDB/PDBParser.py", line 186, in _parse_coordinates % global_line_counter) Bio.PDB.PDBExceptions.PDBConstructionException: Invalid or missing coordinate(s) at line 1280. I'm confused what to do. –  Aug 19 '19 at 06:36
  • Can you add the PDB file as well? It looks like a format issue in the file, not the code. – Maximilian Peters Aug 19 '19 at 06:37
  • 1A2K.pdb This was the one I was using. I tried even two more PDB files, but still getting the same error in different line of PDB. Do I hae to chage anything in BIopython, as the original Question Script the author has updated the PDBList.py in Biopython, But it leads to many more errors. Thank you in advance –  Aug 19 '19 at 06:40

1 Answers1

0

retrieve_pdb_file has the optional parameter file_format. When no information is provided, the PDB server returns cif files. Biopython's parser expects a PDB file.

You can change the line to

pdbList.retrieve_pdb_file(pdb_id, file_format='pdb')

and you should get a PDB file and the rest of the code runs through..

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • Thank you very much. but Now I'm getting this error. " File "Chain_splitter.py", line 75, in chain = line[4] IndexError: string index out of range" Chain_spitter.py is the file that I have saved the script. –  Aug 19 '19 at 07:13
  • I don't know how your script intended to work but I created a txt file with the PDB ID followed by the chain, i.e. `1A2KE`. – Maximilian Peters Aug 19 '19 at 07:15
  • Even I'm doing the same way. I even tried with few PDBs, But I'm getting the same error. Is it working for you? –  Aug 19 '19 at 07:24
  • How are you calling the script? – Maximilian Peters Aug 19 '19 at 07:29
  • "python Chain_splitter.py pdb.txt" and inside pdb.txt I have saved the PDB as 1A2KE –  Aug 19 '19 at 07:32
  • It was some issue with the text file. I changed it and it worked. Thanks a lot for your help. –  Aug 19 '19 at 07:35