0

I have a sequence alignment as:

RefSeq     :MXKQRSLPLXQKRTKQAISFSASHRIYLQRKFSH .....

Templatepdb:-----------------ISFSASHR------FSHAQADFAG 

I am trying to write a code that re-number residues based on this alignment in PDB file as:

original pdb : RES ID= 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 5 ...

new pdb : RES ID = 18 18 18 19 19 19 19 19 20 20 20 21 21 22 23 24 25 31 31 31 31 32 32 33 34 35 36 ...

If alignment only has gaps at beginning of alignment, easy to figure out. Only count gaps("-") and add sum of gaps in to residue.id= " " "sum of gap" " "

However, I could not find a way if there are gaps in the middle of the sequence.

Do you have any suggestions?

azalea
  • 11,402
  • 3
  • 35
  • 46
ggokturk
  • 15
  • 6

1 Answers1

2

If I understand it correctly,

Your input is an alignment:

'-----------------ISFSASHR------FSHAQADFAG'

and a list of residue numbers:

[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 18, 18, 18, 18]

And your output is the residue number shifted by the number of gaps before the residue:

[18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 25, 25, 25, 25, 32, 32, 32, 33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 41, 41, 41, 41]

Below is the code to demonstrate it. There are numerous ways to calculate the output.

The way I do it is to keep a dictionary shift_dict with key as the original number and value as the shifted number.

import itertools
import random


def random_residue_number(sequence):
    nested = [[i + 1] * random.randint(1, 10) for i in range(len(sequence))]
    merged = list(itertools.chain.from_iterable(nested))
    return merged


def aligned_residue_number(alignment, original_number):
    gap_shift = 0
    residue_count = 0
    shift_dict = {}
    for residue in alignment:
        if residue == '-':
            gap_shift += 1
        else:
            residue_count += 1
            shift_dict[residue_count] = gap_shift + residue_count
    return [shift_dict[number] for number in original_number]


sequence = 'ISFSASHRFSHAQADFAG'
alignment = '-----------------ISFSASHR------FSHAQADFAG'
original_number = random_residue_number(sequence)
print(original_number)
# [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 18, 18, 18, 18]
new_number = aligned_residue_number(alignment, original_number)
print(new_number)
# [18, 18, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 22, 22, 22, 22, 22, 22, 22, 23, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 25, 25, 25, 25, 32, 32, 32, 33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 41, 41, 41, 41]
azalea
  • 11,402
  • 3
  • 35
  • 46
  • Thanks a lot, after get this new_number list, do you have any suggestion to replace with residue numbers in .pdb file ? – ggokturk Jun 16 '17 at 09:24
  • Take a look at Biopython's [Bio.PDB](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc149) module, which may help you. Also you can write your own parser to just parse the ATOM section according to [PDB specification](https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html). – azalea Jun 17 '17 at 03:55