I am iterating over a file and creating a set of unique "in" strings. I next iterate over a pair of files and extract the sequence string for each fastq read. I next iterate through the "in" set to see if the string has a Levenshtein distance of <=2 and pick the first "in" sequence that does.
The problem I have is that its very slow having a loop within a loop.
I there a way of speeding this up or a better way of mapping the function to the whole list of in strings and returning the best match?
# This part created a set of strings from infile
inlist = open("umi_tools_inlist_2000.txt", "r")
barcodes = []
for line in inlist:
barcodes.append(line.split("\t")[0])
barcodes = set(barcodes)
# Next I iterate through two fastq files and extract the sequence of each read
with pysam.FastxFile("errors_fullbarcode_read_R1.fastq") as fh, pysam.FastxFile("errors_fullbarcode_read_R2.fastq") as fh2:
for record_fh, record_fh2 in zip(fh, fh2):
barcode = record_fh.sequence[0:24]
for b in barcodes:
if Levenshtein.distance(barcode, b) <= 2:
b = b + record_fh.sequence[24:]
break
else:
pass