4

I would like to extract chains from pdb files. I have a file named pdb.txt which contains pdb IDs as shown below. The first four characters represent PDB IDs and last character is the chain IDs.

1B68A 
1BZ4B
4FUTA

I would like to 1) read the file line by line 2) download the atomic coordinates of each chain from the corresponding PDB files.
3) save the output to a folder.

I used the following script to extract chains. But this code prints only A chains from pdb files.

for i in 1B68 1BZ4 4FUT
do 
wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$i -O $i.pdb
grep  ATOM $i.pdb | grep 'A' > $i\_A.pdb
done
David Cain
  • 16,484
  • 14
  • 65
  • 75
user1545114
  • 75
  • 1
  • 1
  • 4

4 Answers4

11

The following BioPython code should suit your needs well.

It uses PDB.Select to only select the desired chains (in your case, one chain) and PDBIO() to create a structure containing just the chain.

import os
from Bio import PDB


class ChainSplitter:
    def __init__(self, out_dir=None):
        """ Create parsing and writing objects, specify output directory. """
        self.parser = PDB.PDBParser()
        self.writer = PDB.PDBIO()
        if out_dir is None:
            out_dir = os.path.join(os.getcwd(), "chain_PDBs")
        self.out_dir = out_dir

    def make_pdb(self, pdb_path, chain_letters, overwrite=False, struct=None):
        """ Create a new PDB file containing only the specified chains.

        Returns the path to the created file.

        :param pdb_path: full path to the crystal structure
        :param chain_letters: iterable of chain characters (case insensitive)
        :param overwrite: write over the output file if it exists
        """
        chain_letters = [chain.upper() for chain in chain_letters]

        # Input/output files
        (pdb_dir, pdb_fn) = os.path.split(pdb_path)
        pdb_id = pdb_fn[3:7]
        out_name = "pdb%s_%s.ent" % (pdb_id, "".join(chain_letters))
        out_path = os.path.join(self.out_dir, out_name)
        print "OUT PATH:",out_path
        plural = "s" if (len(chain_letters) > 1) else ""  # for printing

        # Skip PDB generation if the file already exists
        if (not overwrite) and (os.path.isfile(out_path)):
            print("Chain%s %s of '%s' already extracted to '%s'." %
                    (plural, ", ".join(chain_letters), pdb_id, out_name))
            return out_path

        print("Extracting chain%s %s from %s..." % (plural,
                ", ".join(chain_letters), pdb_fn))

        # Get structure, write new file with only given chains
        if struct is None:
            struct = self.parser.get_structure(pdb_id, pdb_path)
        self.writer.set_structure(struct)
        self.writer.save(out_path, select=SelectChains(chain_letters))

        return out_path


class SelectChains(PDB.Select):
    """ Only accept the specified chains when saving. """
    def __init__(self, chain_letters):
        self.chain_letters = chain_letters

    def accept_chain(self, chain):
        return (chain.get_id() in self.chain_letters)


if __name__ == "__main__":
    """ Parses PDB id's desired chains, and creates new PDB structures. """
    import sys
    if not len(sys.argv) == 2:
        print "Usage: $ python %s 'pdb.txt'" % __file__
        sys.exit()

    pdb_textfn = sys.argv[1]

    pdbList = PDB.PDBList()
    splitter = ChainSplitter("/home/steve/chain_pdbs")  # Change me.

    with open(pdb_textfn) as pdb_textfile:
        for line in pdb_textfile:
            pdb_id = line[:4].lower()
            chain = line[4]
            pdb_fn = pdbList.retrieve_pdb_file(pdb_id)
            splitter.make_pdb(pdb_fn, chain)

One final note: don't write your own parser for PDB files. The format specification is ugly (really ugly), and the amount of faulty PDB files out there is staggering. Use a tool like BioPython that will handle parsing for you!

Furthermore, instead of using wget, you should use tools that interact with the PDB database for you. They take FTP connection limitations into account, the changing nature of the PDB database, and more. I should know - I updated Bio.PDBList to account for changes in the database. =)

David Cain
  • 16,484
  • 14
  • 65
  • 75
  • Thank you very much for your code and explanation. But I don't know, how to run your code? Could you help me? – user1545114 Jul 27 '12 at 22:50
  • Sure thing. I added a method that's specifically suited to your goals. Change the line in the main body to the directory you want your files to go to. [Install Python](http://www.python.org/getit/) (you probably already have it, try `$ which python`), and install [BioPython](http://biopython.org/wiki/Download#Installation_Instructions). Save the above file with a `.py` extension (e.g. `extract.py`), then run `$python extract.py pdb.txt`. That's it! – David Cain Jul 28 '12 at 03:42
  • If you're doing more bioinformatics-related things, I highly suggest you learn Python. It's very popular in the field ([BioPython](http://biopython.org) and [PyMOL](http://pymol.org), are two good examples), and it's a good general language. The [Python docs](http://docs.python.org) and [Think Python](http://www.greenteapress.com/thinkpython/) are both good places to start. – David Cain Jul 28 '12 at 04:57
2

It is probably a little late for asnwering this question, but I will give my opinion. Biopython has some really handy features that would help you achieve such a think easily. You could use something like a custom selection class and then call it for each one of the chains you want to select inside a for loop with the original pdb file.

    from Bio.PDB import Select, PDBIO
    from Bio.PDB.PDBParser import PDBParser

    class ChainSelect(Select):
        def __init__(self, chain):
            self.chain = chain

        def accept_chain(self, chain):
            if chain.get_id() == self.chain:
                return 1
            else:          
                return 0

    chains = ['A','B','C']
    p = PDBParser(PERMISSIVE=1)       
    structure = p.get_structure(pdb_file, pdb_file)

    for chain in chains:
        pdb_chain_file = 'pdb_file_chain_{}.pdb'.format(chain)                                 
        io_w_no_h = PDBIO()               
        io_w_no_h.set_structure(structure)
        io_w_no_h.save('{}'.format(pdb_chain_file), ChainSelect(chain))
  • How would you keep the secondary structures in the chain files? This only gets the atoms from each chain and you loose the HELIX and SHEET annotations. – Brian Wiley Jul 26 '20 at 11:22
  • The only way I can think off is to extract that information from the *PDBParser* object using the **get_header()** method, and then add it to the file later on. This might help https://biopython.org/docs/1.75/api/Bio.PDB.PDBParser.html#Bio.PDB.PDBParser.PDBParser.get_header – Carlos Rodrigues Jul 23 '21 at 01:51
0

Lets say you have the following file pdb_structures

1B68A 
1BZ4B
4FUTA

Then have your code in load_pdb.sh

while read name
do
    chain=${name:4:1}
    name=${name:0:4}
    wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="$name -O $name.pdb
    awk -v chain=$chain '$0~/^ATOM/ && substr($0,20,1)==chain {print}' $name.pdb > $name\_$chain.pdb
    #   rm $name.pdb
done

uncomment the last line if you don't need the original pdb's.
execute

cat pdb_structures | ./load_pdb.sh
tzelleke
  • 15,023
  • 5
  • 33
  • 49
  • This is simple and effective, but it ignores [HETATM](http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#HETATM) records, which are part of a chain. Why write your own parser for a complex file format when there are plenty of already existing parsers? – David Cain Jul 27 '12 at 11:52
  • Also, using a PDB-retrieval program is a much better option than using wget with a hand-retrieved URL. Compression is handled automatically, among other features. See the final paragraph of my answer. – David Cain Jul 27 '12 at 12:13
  • @zelleke, Thank you very much for your answer.I tried your code. I don't need all chains from pdb files. I have mentioned the chain Id in the text file. for example, I need only 'A' chain from '1B68'. – user1545114 Jul 27 '12 at 12:31
0

The accepted biopython solution seems to no longer work out of the box in 2023, but I found a simple R-based solution that worked for me and want to share for others who come across this post.

library(bio3d)
pdbsplit("pdb.txt", path = "split_chain")

This will split the file into one file per chain in the directory "split_chain". The bio3d package documentation includes more details.