I am a newbie in the python world and bioinformatics. I am dealing with a almost 50GB structured file to write it out. So I would like to take some great tips from you.
The file goes like this. (it's actually called FASTQ_format)
@Machinename:~:Team1:atcatg 1st line.
atatgacatgacatgaca 2nd line.
+ 3rd line.
asldjfwe!@#$#%$ 4th line.
These four lines are repeated in order. Those 4 lines are like a team.
And I have nearly 30 candidates DNA sequences. e.g. atgcat
, tttagc
What I am doing is have each candidate DNA sequence going through the huge file to find whether a candidate sequence is similar to Team dna sequence, which means allowing one mismatch to each (e.g. taaaaa
= aaaata
) and if they are similar or same, I use dictionary to store them to write it out later. key for candidate DNA sequence. Value for (4 lines) in List to store them in order by line order
So what I have done is:
def myfunction(str1, str2): # to find if they are similar( allowed one mis match) if they are similar, it returns true
f = open('hugefile')
diction = {}
mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]
while True:
line = f.readline()
if not line:
break
if "machine name" in line:
teamseq = line.split(':')[-1]
if my function(candidate dna, team dna) == True:
if not candidate dna in diction.keys():
diction[candidate dna] = []
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
else: # chances some same team dna are repeated.
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
f.close()
wf = open(hughfile+".out", 'w')
for i in candidate dna list: # dna 1 , dna2, dna3
wf.write(diction[i] + '\n')
wf.close()
My function doesn't use any global variables (I think I am happy with my function), whereas the dictionary variable is a global variable and takes all the data as well as making lots of list instances. The code is simple but so slow and such a big pain in the butt to the CPU and memory. I use pypy though.
So any tips write it out in order by line order?