In my project, I need to make oneHotEncode for millions of DNA sequences for ~100 time(in total, billions of times of similar sequences). So an effiect way will be very improtant for me.
Bellow is my code, which takes 4.5s for 10K sequences.
import numpy as np
import os,sys,time
def dna2onehot(dnaSeq):
seqLen = len(dnaSeq)
dnaSeq = dnaSeq.upper()
# initialize the matrix to seqlen x 4
seqMatrix = np.zeros((seqLen,4))
# change the value to matrix
for i in range(0,seqLen):
if dnaSeq[i] == 'A':
seqMatrix[i,0] = 1
if dnaSeq[i] == 'C':
seqMatrix[i,1] = 1
if dnaSeq[i] == 'G':
seqMatrix[i,2] = 1
if dnaSeq[i] == 'T':
seqMatrix[i,3] = 1
ret = np.array(seqMatrix.flat)
return ret
#
sequence = "TCTGAGTCCCAATACACAAGAGGTTCCCTCACCTGTTCTGGTGTCAGACCCTCCCAGATGATCACCTCTCCTATGGCGGGGAAGGTGCCTGGATGTCTAAAGCCTGAAATGGGGATCTATCCCAGAAGCTGTGTAGCTTCTGCCTGTCCCAGAAGCTGTGTTGTTTCTGTATTCAGCTTGCTCACCCTCCGCAGTCCATTGATCTGCACAGACTGTTCTCAGATGGACTCGTGAGACAAGATGGCTCCTTCACCTGCTCTGGGGATCAGAACCCTCCCAGGTGGCCACCTCTCCTGTGGTGGGGAAGGTACCTGGAAGTCTTCAGCCCAAAACAGGGCCTGTCCCAGAAGCTGTGTCTCTTCTGCCTATCCCAGAAGCTGTATTGCTTCTGCTGTCCACTTGCTCACCCTCTGCAGTCTGCATGCTGATCTGCGCAGACTGTTCTCAGAGGGATCTGGCAGACAAGTTGGCTCCCTCACCTGCTCTGGGGCGGGGGGGGGGGGTTCAGAGCCCTCCTGGGCAGCCACCTCTCCTCTAGCAGAGAAGGTGCTGGGATGTCTTGAGCAGGAAACGGGGTATGTCCCAGAAGCTGTCTTGCTTCTGCAATCCACATGCTCAGCCTCTGCAGTCTGTGAGCTAATCTGGGCAGTCTGGTCTCAGGGGACTCTGGAGACAAGATGGCTCCCTCACCTGCTCTGGGGGTCAAAGCCCTCCTTGGCAGCCACCTTTTTCAGGCGGAGAAGGTGCCCGGATGTCTGGAGCCTGAAACAGGGGTATGTCCCAGACACTGTGTAGCTTCTGCCTGCCCCAGAAGATGTGTCACTTCCTCAGTCTGCTTGTTCACCCTCCACAGTCTGCAAGCTGATCTGCACAGACTGGTCTCAGAGGGACCTAGAAGACAAGATCAAGAAAAGTCTTATAGGTATAATGAATCAAGCAGAAAATGAAACATCAGAAGCTTAAGATAAAATACAGGATCTAGTCCAAATTAGCAAGAAGTA"
count = 10000
datalist = []
t1 = time.time()
for k in range(count):
datalist.append(dna2onehot(sequence))
#
t2 = time.time()
print("time cost:",t2-t1)
Do you have any suggestion to reduce the time with python (My whole project was based on python)?