1

I want to do pairwise alignment with uniprot and pdb sequences. I have an input file containing uniprot and pdb IDs like this.

pdb id  uniprot id

1dbh    Q07889
1e43    P00692
1f1s    Q53591

first, I need to read each line in an input file 2) retrieve the pdb and uniprot sequences from pdb.fasta and uniprot.fasta files 3) Do alignment and calculate sequence identity.

Usually, I use the following program for pairwise alignment and seq.identity calculation.

library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")

I need to print the output like this

pdbid   uniprotid  seq.identity
1dbh    Q07889      99
1e43    P00692      80
1f1s    Q53591      56

How can I change the above code ? your help would be appreciated!

'

  • `Biostrings` is not in CRAN. BTW, I see no calls to functions from `seqinr` . Why did you load it? But, anyway, if you could provide a sample of the output from `pid` , perhaps someone can suggest how to rearrange the data into the format you request. – Carl Witthoft Feb 08 '13 at 12:32
  • 1
    Might try the no-subscription-required Bioconductor [mailing list](http://bioconductor.org/help/mailing-list/mailform) for help with Biostrings – Martin Morgan Feb 08 '13 at 13:42

2 Answers2

1

This code is hopefully what your looking for:

class test():

    def get_seq(self, pdb,fasta_file): # Get sequences
        from Bio.PDB.PDBParser import PDBParser
        from Bio import SeqIO
        aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
        p=PDBParser(PERMISSIVE=1)
        structure_id="%s" % pdb[:-4]
        structure=p.get_structure(structure_id, pdb)
        residues = structure.get_residues()
        seq_pdb = ''
        for res in residues:
            res = res.get_resname() 
            if res in aa:
                seq_pdb = seq_pdb+aa[res]           

        handle = open(fasta_file, "rU")
        for record in SeqIO.parse(handle, "fasta") :
            seq_fasta = record.seq
        handle.close()
        self.seq_aln(seq_pdb,seq_fasta)

    def seq_aln(self,seq1,seq2): # Align the sequences
        from Bio import pairwise2
        from Bio.SubsMat import MatrixInfo as matlist

        matrix = matlist.blosum62
        gap_open = -10
        gap_extend = -0.5

        alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
        top_aln = alns[0]
        aln_seq1, aln_seq2, score, begin, end = top_aln
        with open('aln.fasta', 'w') as outfile:
            outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
        print aln_seq1+'\n'+aln_seq2
        self.seq_id('aln.fasta')

    def seq_id(self,aln_fasta): # Get sequence ID
        import string
        from Bio import AlignIO

        input_handle = open("aln.fasta", "rU")
        alignment = AlignIO.read(input_handle, "fasta")
        j=0 # counts positions in first sequence
        i=0 # counts identity hits
        for record in alignment:
            #print record
            for amino_acid in record.seq:
                if amino_acid == '-':
                    pass
                else:
                    if amino_acid == alignment[0].seq[j]:
                        i += 1
                j += 1
            j = 0
            seq = str(record.seq)
            gap_strip = seq.replace('-', '')

            percent = 100*i/len(gap_strip)
            print record.id+' '+str(percent)
            i=0


a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')

This outputs:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence

For this to work on you input file, you need to put my a.get_seq() in a for loop with the inputs from your text file.

EDIT:

Replace the seq_id function with this one:

def seq_id(self,aln_fasta):
    import string
    from Bio import AlignIO
    from Bio import SeqIO

    record_iterator = SeqIO.parse(aln_fasta, "fasta")
    first_record = record_iterator.next()
    print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-','')))
    second_record = record_iterator.next()
    print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-','')))

    lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
    if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
        print 'PDB sequence has the shortest length'
    else:
        print 'Uniport sequence has the shortes length'

    idenities = 0   
    for i,v in enumerate(first_record.seq):
        if v == '-':
            pass
            #print i,v, second_record.seq[i]
        if v == second_record.seq[i]:
            idenities +=1
            #print i,v, second_record.seq[i], idenities

    print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))

to pass the arguments to the class use:

with open('input_file.txt', 'r') as infile:
    next(infile)
    next(infile) # Going by your input file
    for line in infile:
        line = line.split()
        a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')
Harpal
  • 12,057
  • 18
  • 61
  • 74
  • Thank you very much for your code. I couldn't execute your code using for loop.Could you please help me? I need to use the following formula for percent identity PID= 100*(identical positions/shortest sequence). – user2053919 Feb 09 '13 at 01:18
  • Thank you very much for your code. When I execute your code, I got the following error. File "./ali.py", line 75 with open('input_file.txt', 'r') as infile: ^ IndentationError: unindent does not match any outer indentation level – user2053919 Feb 12 '13 at 10:02
  • Check the indentation is tab and not spaces, The class I wrote is indented with tabs, where as I think the last bit of my answer maybe indented with 4 spaces. – Harpal Feb 12 '13 at 11:03
0

It might be something like this; a repeatable example (e.g., with short files posted on-line) would help...

library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")

to input all sequences into two objects. pairwiseAlignment accepts a vector as first (query) argument, so if you were wanting to align all pdb against all uniprot pre-allocate a result matrix

pids = matrix(numeric(), length(uniprot), length(pdb),
              dimnames=list(names(uniprot), names(pdb)))

and then do the calculations

for (i in seq_along(uniprot)) {
    globalAlignment = pairwiseAlignment(pdb, uniprot[i])
    pids[i,] = pid(globalAlignment)
} 
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112