I need to deal with super large txt input files, and I usually use .readlines() to first read the whole file, and turn it into a list.
I know it's really memory-cost and can be quite slow, but I also need to make use of LIST characteristics to manipulate the specific lines, like below:
#!/usr/bin/python
import os,sys
import glob
import commands
import gzip
path= '/home/xxx/scratch/'
fastqfiles1=glob.glob(path+'*_1.recal.fastq.gz')
for fastqfile1 in fastqfiles1:
filename = os.path.basename(fastqfile1)
job_id = filename.split('_')[0]
fastqfile2 = os.path.join(path+job_id+'_2.recal.fastq.gz')
newfastq1 = os.path.join(path+job_id+'_1.fastq.gz')
newfastq2 = os.path.join(path+job_id+'_2.fastq.gz')
l1= gzip.open(fastqfile1,'r').readlines()
l2= gzip.open(fastqfile2,'r').readlines()
f1=[]
f2=[]
for i in range(0,len(l1)):
if i % 4 == 3:
b1=[ord(x) for x in l1[i]]
ave1=sum(b1)/float(len(l1[i]))
b2=[ord(x) for x in str(l2[i])]
ave2=sum(b2)/float(len(l2[i]))
if (ave1 >= 20 and ave2>= 20):
f1.append(l1[i-3])
f1.append(l1[i-2])
f1.append(l1[i-1])
f1.append(l1[i])
f2.append(l2[i-3])
f2.append(l2[i-2])
f2.append(l2[i-1])
f2.append(l2[i])
output1=gzip.open(newfastq1,'w')
output1.writelines(f1)
output1.close()
output2=gzip.open(newfastq2,'w')
output2.writelines(f2)
output2.close()
In general, I'm trying to read every 4th line of the whole text, but if the 4th line meets the desired condition, I'll append these 4 lines into the text. So can I avoid readlines() to achieve this? thx
EDIT: Hi, actually I myself found a better way:
import commands
l1=commands.getoutput('zcat ' + fastqfile1).splitlines(True)
l2=commands.getoutput('zcat ' + fastqfile2).splitlines(True)
I think 'zcat' is super fast.... It took around 15min to readlines, while only 1 minute to just zcat...