Python: build consensus sequence

Question

I want to build a consensus sequence from several sequences in python and I'm looking for the most efficient / most pythonic way to achieve this.

I have a list of strings like this:

sequences = ["ACTAG", "-TTCG", "CTTAG"]

I furthermore have an alphabet like this:

alphabet = ["A", "C", "G", "T"]

and a position frequency matrix like this:

If a character occurrs the most at a position, this character is taken for the consensus sequence.

Additionally, when 2 or more characters have the same occurrences for the same position there are additional characters (in this example at position 0 => A or C = M, see IUPAC Codes)

The expected consensus sequence for my example is therefore "MTTAG".

EDIT:

What is the most efficient / most pythonic way to get this consensus sequence based on the given alphabet and position frequency matrix?

I think your frequency matrix should have the 3 at the G column in the last row — Jan Wilamowski, Jun 04 '21 at 13:18
Most pythonic might be something like `"".join(pos_to_code(pos) for pos in zip(*sequences))` if you have a function to translate a single position into a code but that might not be the most efficient way. — Jan Wilamowski, Jun 04 '21 at 13:28
Hi @JanWilamowski , what would be the best approach when using the alphabet and the position frequency matrix? — Thomas Müller, Jun 04 '21 at 15:14

score 3 · Accepted Answer · answered Jun 06 '21 at 08:09

If you already have the position frequency matrix, you could process it as a pandas DataFrame. I chose to orient it such that the alphabet is the index (note the transpose call at the end):

freq = pd.DataFrame([[1, 1, 0, 0], [0, 1, 0, 2], [0, 0, 0, 3], [2, 1, 0, 0], [0, 0, 3, 0]], columns=['A', 'C', 'G', 'T']).transpose()

gives

   0  1  2  3  4
A  1  0  0  2  0
C  1  1  0  1  0
G  0  0  0  0  3
T  0  2  3  0  0

You'll want to look only at the most common nucleotides:

most_common = freq[freq == freq.max(axis=0)]

gives

     0    1    2    3    4
A  1.0  NaN  NaN  2.0  NaN
C  1.0  NaN  NaN  NaN  NaN
G  NaN  NaN  NaN  NaN  3.0
T  NaN  2.0  3.0  NaN  NaN

Then create a function that determines the consensus from a single column of the above matrix, based on IUPAC codes:

codes = {
    'A': 'A', 'C': 'C', 'G': 'G', 'T': 'T', 
    'AG': 'R', 'CT': 'Y', 'CG': 'S', 'AT': 'W', 'GT': 'K', 'AC': 'M', 
    'CGT': 'B', 'AGT': 'D', 'ACT': 'H', 'ACG': 'V', 
    'ACGT': 'N'
}
def freq_to_code(pos):
    top_nucs = pos.dropna().index
    key = ''.join(sorted(top_nucs))
    return codes[key]

Apply that function to each column and form a string to get the final result:

consensus = most_common.apply(freq_to_code, axis=0)
print(''.join(consensus))

gives MTTAG

Python: build consensus sequence

1 Answers1