9

I would like to generate one hot encoding for a set of DNA sequences. For example the sequence ACGTCCA can be represented as below in a transpose manner. But the code below will generate the one hot encoding in horizontal way in which I would prefer it in vertical form. Can anyone help me?

ACGTCCA 
1000001 - A
0100110 - C 
0010000 - G
0001000 - T

Example code:

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example sequences
seqs = ["ACGTCCA","CGGATTG"]


# split sequences to tokens
tokens_seqs = [seq.split("\\") for seq in seqs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_seqs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_seq] for tokens_seq in tokens_seqs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

However, the code gives me output:

[[ 0.  1.]
 [ 1.  0.]]

Expected output:

[[1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.]]
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
Xiong89
  • 767
  • 2
  • 13
  • 24

3 Answers3

15
def one_hot_encode(seq):
    mapping = dict(zip("ACGT", range(4)))    
    seq2 = [mapping[i] for i in seq]
    return np.eye(4)[seq2]

one_hot_encode("AACGT")

## Output: 
array([[1., 0., 0., 0.],
   [1., 0., 0., 0.],
   [0., 1., 0., 0.],
   [0., 0., 1., 0.],
   [0., 0., 0., 1.]])
DrIDK
  • 7,642
  • 2
  • 14
  • 14
6

I suggest doing it a slightly more manual way:

import numpy as np

seqs = ["ACGTCCA","CGGATTG"]

CHARS = 'ACGT'
CHARS_COUNT = len(CHARS)

maxlen = max(map(len, seqs))
res = np.zeros((len(seqs), CHARS_COUNT * maxlen), dtype=np.uint8)

for si, seq in enumerate(seqs):
    seqlen = len(seq)
    arr = np.chararray((seqlen,), buffer=seq)
    for ii, char in enumerate(CHARS):
        res[si][ii*seqlen:(ii+1)*seqlen][arr == char] = 1

print res

This gives you your desired result:

[[1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 1 0]]
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks. This solved perfectly for single bases A, T, C, G. May I know could it be possible to use on dinucleotide which have 16 possible combinations like AA, AT, AC, AG, etc. If it appeared as TGA, then TG will be given value 1 and GA too given the value 1. – Xiong89 Dec 14 '15 at 10:24
  • @Xiong89: I'm sure you can make it work, but if I'm going to write all your code for you I might ask an hourly rate! I think the main building block you need for the di case is this: `arr = np.chararray((3, 2), buffer='asdfds')` then `np.all(np.chararray((2,), buffer='as') == arr, axis=1)` - this gives you an array of `[True, False, False]`. That is, you need to build all 16 two-character combos as `chararray`s, then match against those. Give it a try. :) – John Zwinck Dec 14 '15 at 10:32
0
from keras.utils import to_categorical

def one_hot_encoding(seq):
    mp = dict(zip('ACGT', range(4)))
    seq_2_number = [mp[nucleotide] for nucleotide in seq]
    return to_categorical(seq_2_number, num_classes=4, dtype='int32').flatten()
Rajan saha Raju
  • 794
  • 7
  • 13