-1

I'm trying to choose unique regions for my Non-invasive Prenatal Testing (NIPT) project. I have done following steps:

  • Create an artificial fasta file contains 50bp sequence. On each chromosome, the next sequence overlap 40bp from previous sequence
  • Align and only chosen no mismatch sequence

I have a .sam file about 40gb, on the next step I try to merge all overlapping coordinates to one .bed file for using in samtools. This is my python script to do that:

import glob
import time

folder = glob.glob('analysys/*.sam')
core = 22
nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']
for file in folder:
    now = time.time()
    print(f'Analyzing {file}')
    raw = []
    treat1 = []
    treat2 = []
    with open(file) as samfile:
        print('Importing coordinates')
        for line in samfile:
            coord_raw = line.split('\t')[0]
            coord_chr = coord_raw.split('_')[0]
            coord_loc = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
            raw.append(({coord_chr},{coord_loc[0]},{coord_loc[1]}))
        print('Begin merging overlapping area...')
        last_coord = () #0:chr, 1:begin, 2:end
        for chrom ,begin , end in raw:
            if last_coord == () or last_coord[0] != chrom:
                last_coord = (chrom,begin,end)
            else:
                if last_coord[0] == chrom:
                    if begin > last_coord[1] and end < last_coord[2]:
                        last_coord = (chrom,last_coord[1],end)
                    else:
                        treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                        last_coord = (chrom,begin,end)
                else:
                    treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                    last_coord = (chrom,begin,end)       
        print('Begin merging nearby area...')                         
        last_coord = ()
        for chrom ,begin , end in treat1:
            if last_coord == ():
                last_coord = (chrom,begin,end) 
            else:
                if chrom == last_coord[0]:
                    if begin == last_coord[2] + 1:
                        last_coord = (last_coord[0],last_coord[1],end)
                    else:
                        treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                        last_coord = (chrom,begin,end)
                else:
                    treat2.append(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}')
                    last_coord = (chrom,begin,end)
print('Outputting...')
with open('unique_coord.bed','w') as bedfile:
    bedfile.write('\n'.join(treat2))
print(f'Program completed! Total time: {int(time.time() - now)} seconds')

However, after 30 minutes running, I found that the program consumed all of my memory and crashed. Is there any advice for me to reduce the amount of memory this script consumed? thank you very much

user4157124
  • 2,809
  • 13
  • 27
  • 42
Rotomegax
  • 25
  • 5
  • Looking over your code quickly, it looks like you load the whole file in and create `raw`, then you have another loop to create `treat1`, then you have another loop to create `treat2` (which also stays in memory) and then you write `treat2` to a file. From what I can tell, there is no need to hold all of `raw` and `treat1` in memory. Is there a way to write the code so that after reading in one line of your input file, you go through all the steps to determine what if anything goes into `treat2` and then write it to the output file without storing everything in memory? – ramzeek Mar 14 '22 at 03:33
  • Its because I need to merge overlap area, I tried to remove the step load all to raw but its complicated with my skill. `treat1` contains all merged coordinate and `treat2` contains merging of nearby overlapping area. – Rotomegax Mar 14 '22 at 03:43
  • As far as I can tell, you only ever use one line of `raw` at a time, and only ever one line of `treat1` at a time, and then once you've created `treat2` you just write it to a file. Is that correct? Or am I missing something? – ramzeek Mar 14 '22 at 03:51
  • I load everything to raw, then process to merge overlapping areas to treat1. Then I found does those merging area next to each other or not, if yes, I merge all of them and finally export to treat2. – Rotomegax Mar 14 '22 at 03:56
  • I've tried to give you an answer as best I can. If this isn't sufficient (or even if it is) I think it would be very beneficial to create a small version of one of your files and what the output would look like and ask that as another, separate question. Based on your description I am pretty sure there is a more efficient way to program this, I just can't guess what it would be. – ramzeek Mar 14 '22 at 04:23

1 Answers1

1

Hopefully this does the trick. Hard to know since I'm working without having the input file or knowing what the output should look like and can't really run the code to check for errors, but here's what I've tried to do:

  1. Eliminate raw and go straight to creating treat1; and
  2. Eliminate treat2 and go straight to writing to bedfile. I opened bedfile right off the bat and everything in your program is done with that file open.

If this does what you need but still crashes, then perhaps you can read through each file twice to get last_coord at the end of the treat1 and then read through it again to "recreate" each line of treat1 individually and apply it to defining what needs to be written into the file.

Without really knowing the details of what you're doing (I do not work anywhere close to a field that would apply samtools).

import glob
import time

folder = glob.glob('analysys/*.sam')

# These two lines are not used
#core = 22
#nst = [f'chr{x+1}' for x in range(22)] + ['chrX','chrY']

# moved this to outside the for loop since the time check was 
# after the for loop
now = time.time() 

with open('unique_coord.bed','w') as bedfile:
    for file in folder:
        print(f'Analyzing {file}')
        raw = []
        treat1 = []
        with open(file) as samfile:
            print('Importing coordinates')
            last_coord = ()
            for line in samfile:
                chrom = line.split('\t')[0]
                begin = coord_raw.split('_')[0]
                end = [int(r) for r in coord_raw.split('_')[1].split('-')] #0: begin, 1: end
                if last_coord == () or last_coord[0] != chrom:
                    last_coord = (chrom,begin,end)
                else:
                    if last_coord[0] == chrom:
                        if begin > last_coord[1] and end < last_coord[2]:
                            last_coord = (chrom,last_coord[1],end)
                        else:
                            treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                            last_coord = (chrom,begin,end)
                    else:
                        treat1.append(({last_coord[0]},{last_coord[1]},{last_coord[2]}))
                        last_coord = (chrom,begin,end)       
            print('Begin merging nearby area...')                         
            last_coord = ()
            for chrom ,begin , end in treat1:
                if last_coord == ():
                    last_coord = (chrom,begin,end) 
                else:
                    if chrom == last_coord[0]:
                        if begin == last_coord[2] + 1:
                            last_coord = (last_coord[0],last_coord[1],end)
                        else:
                            bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                            last_coord = (chrom,begin,end)
                    else:
                        bedfile.write(f'{last_coord[0]}\t{last_coord[1]}\t{last_coord[2]}\n')
                        last_coord = (chrom,begin,end)

print(f'Program completed! Total time: {int(time.time() - now)} seconds')
ramzeek
  • 2,226
  • 12
  • 23