I have adapted the sliding window generator function here (https://scipher.wordpress.com/2010/12/02/simple-sliding-window-iterator-in-python/) for my needs. It is my first experience with generator functions so I did a lot of background reading. Given my (still) limited experience, I'm soliciting advice for the following problem:
The code below does this: I use the sliding-window function to iterate over a 5,500-character string (DNA sequence with ~5,500 bp) in roughly 250-char windows with a step size of 1. For each chunk, I compare its GC content to a line in a 750-line file. (GC content is the percentage of the string elements that equal G or C).
However, for my downstream use I would really like to loop over these chunks randomly. From my Stack Overflow searching, I understand that it is not possible to shuffle a generator object, and that I cannot shuffle the windows inside the function because it actually searches the windows one at a time, returning to the function for the next chunk because of that "yield". (Please correct me if I've misunderstood).
Currently, my code looks something like this (using the generator function in the link above, of course):
with open('/pathtofile/file.txt') as f:
for line in f:
line = line.rstrip()
# For each target, grab target length (7), gc content (8)
targ_length = line.split("\t")[8]
gc = int(line.split("\t")[7])
# Window size = amplicon length minus length of fwd and rev primers
# Use a sliding window function to go along "my_seq" (5,500bp sequence). Check GC content for each window.
chunks = slidingWindow(my_seq, targ_length, step=1)
found = 0
for i in chunks:
# When GC content = same as file, save this window as the pos ctrl fragment & add primers to it
dna_list = list(i)
gc_count = dna_list.count("G") + dna_list.count("C")
gc_frac = int((gc_count / len(dna_list)) * 100)
# if (gc - 5) < gc_frac < (gc + 5):
if gc_frac == gc:
found = 1
# Store this piece
break
if found == 0:
# Store some info to look up later
Anyone have ideas for the best approach? To me the most obvious (also based on Stack Overflow searches) is to re-write it without a generator function. I'm concerned about looping 750 times over a list containing roughly 5,251 elements. Should I be? Generators seem like an elegant solution to what I want to do, except now that I've decided I want to randomize the chunk order. It seems clear I need to sacrifice efficiency to do this, but I'm wondering whether more experienced coders have some clever solutions. Thanks!