I'm trying to choose unique regions for my Non-invasive Prenatal Testing (NIPT) project. I have done following steps:
- Create an artificial fasta file contains 50bp sequence. On each chromosome, the next sequence overlap 40bp from previous sequence
- Align and only chosen no mismatch sequence
I have a .sam file about 40gb, on the next step I try to merge all overlapping coordinates to one .bed file for using in samtools. This is my python script to do that:
import glob
import time
folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
now = time.time()
print(f'Analyzing {file}')
raw = []
treat1 = []
treat2 = []
with open(file) as samfile:
print('Importing coordinates')
for line in samfile:
coord_raw = line.split('\t')[0]
coord_chr = coord_raw.split('_')[0]
coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
print('Begin merging overlapping area...')
last_coord = () #0:chr, 1:begin, 2:end
for chrom ,begin , end in raw:
if last_coord == () or last_coord[0] != chrom:
last_coord = (chrom,begin,end)
else:
if last_coord[0] == chrom:
if begin > last_coord[1] and end < last_coord[2]:
last_coord = (chrom,last_coord[1],end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
else:
treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
last_coord = (chrom,begin,end)
print('Begin merging nearby area...')
last_coord = ()
for chrom ,begin , end in treat1:
if last_coord == ():
last_coord = (chrom,begin,end)
else:
if chrom == last_coord[0]:
if begin == last_coord[2] + 1:
last_coord = (last_coord[0],last_coord[1],end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
else:
treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')
However, after 30 minutes running, I found that the program consumed all of my memory and crashed. Is there any advice for me to reduce the amount of memory this script consumed? thank you very much