-1

I am iterating over a file and creating a set of unique "in" strings. I next iterate over a pair of files and extract the sequence string for each fastq read. I next iterate through the "in" set to see if the string has a Levenshtein distance of <=2 and pick the first "in" sequence that does.

The problem I have is that its very slow having a loop within a loop.

I there a way of speeding this up or a better way of mapping the function to the whole list of in strings and returning the best match?

# This part created a set of strings from infile
inlist = open("umi_tools_inlist_2000.txt", "r")
barcodes = []

for line in inlist:
    barcodes.append(line.split("\t")[0])

barcodes = set(barcodes)

# Next I iterate through two fastq files and extract the sequence of each read
with pysam.FastxFile("errors_fullbarcode_read_R1.fastq") as fh, pysam.FastxFile("errors_fullbarcode_read_R2.fastq") as fh2:

for record_fh, record_fh2  in zip(fh, fh2):
    barcode = record_fh.sequence[0:24]

    for b in barcodes:

        

        if Levenshtein.distance(barcode, b) <= 2:

            b = b + record_fh.sequence[24:]
            break
            
        else:
            pass
Adam_C
  • 1
  • 1
    This may be relevant https://stackoverflow.com/questions/5696859/how-do-i-use-kd-trees-for-determining-string-similarity , if you actually want to improve the quality of algo. Or if you change Levenshtein to more simple metrics you may involve kdtree directly to remove one of loops https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html. Or you can switch to a more performance language, c++, go. – Askold Ilvento Nov 08 '20 at 09:44
  • Thanks, will take a look at cKDTree. Im also thinking C++ might be the best place to go next though. – Adam_C Nov 08 '20 at 09:50
  • Are there many calls to Levenshtein with the same two codes? If so you might be gain from caching those results using e.g. `lru.cache` or by homegrown storing in a dictionary so you avoid recalculating the distance. – DisappointedByUnaccountableMod Nov 08 '20 at 09:50

1 Answers1

0

You can use dictionary instead of list where keys will be barcodes.

If you don't like to use the loop in loop you can use list comprehension and test performance

inlist = open("umi_tools_inlist_2000.txt", "r")
barcodes = dict()

for line in inlist:
    barcodes[line.split("\t")[0]] = 0

# Next I iterate through two fastq files and extract the sequence of each read
with pysam.FastxFile("errors_fullbarcode_read_R1.fastq") as fh, pysam.FastxFile("errors_fullbarcode_read_R2.fastq") as fh2:

for record_fh, record_fh2  in zip(fh, fh2):
    barcode = record_fh.sequence[0:24]

    b_found = [b for b in barcodes.keys() if Levenshtein.distance(barcode, b) <= 2]
    # as per your logic b_found will have only zero or one ele
    if b_found:  
        new_b = barcodes[b_found[0]] + record_fh.sequence[24:]
        barcodes[new_b] = 0
        del barcodes[b_found[0]]
Aaj Kaal
  • 1,205
  • 1
  • 9
  • 8