How to speed up (fasta) subsampling program for Python?

Question

I have devised edited a little script that subsamples x lines from an original file. The original file is fasta, which has two lines for every sequence, and the program extracts those x number of sequences (the two lines together). This is how it looks:

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define lists
fNames = []
fSeqs = []
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
    else:
        fSeqs.append(line.rstrip())

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Take random items out of the list for the total number of samples required
for j in range(num):
    a = random.randint(0, (len(fNames)-1))
    print(fNames.pop(a), file = outfile)
    print(fSeqs.pop(a), file = outfile)

infile.close()
outfile.close()
input("Done.")

The creation of the lists with the ID and The nucleotides (line 1 and 2 respectively) goes pretty quick, but the print out takes forever. The numbers being extracted can go up to 2M but it starts going slow from the 10000.

I was wondering if there is any way to make it faster. Is the .pop the issue? Would it be faster if I created the a random list of unique numbers first and then extracted those?

Finally, the terminal does not go back to "normal finished state" after printing Done. and I do not know why. With all my other scripts I can inmediately type when they are done.

Have you tried [random.sample](https://docs.python.org/3/library/random.html#random.sample) instead, with a combined list? — maxy, Jan 13 '18 at 14:15
It worked like a Charm! I did not know what a combined list was, so Instead of a list for the sequences I used a diccionary with the IDs as key, and I called them up according to the ID. — Néstor, Feb 07 '18 at 16:01

score 0 · Accepted Answer · edited Oct 12 '19 at 06:30

random.sample (which was suggested in a comment) and a dictionary makes the script much faster. Here is the final script:

#!/usr/bin/env python3
import random
import sys
# How many random sequences do you want?
num = int(input("Enter number of random sequences to select:\n"))

# Import arguments
infile = open(sys.argv[1], "r")
outfile = open(sys.argv[2], "w")

# Define list and dictionary
fNames = []
dicfasta = {}
# Extract fasta file into the two lists
for line in infile:
    if line.startswith(">"):
        fNames.append(line.rstrip())
        Id = line.rstrip()
    else:
        dicfasta[Id] = line.rstrip()

# Print total number of sequences in the original file
print("There are "+str(len(fNames))+" in the input file")

# Create subsamples
subsample = []
subsample = random.sample(fNames, num)

# Take random items out of the list for the total number of samples required
for j in subsample:
    print(j, file = outfile)
    print(dicfasta[j], file = outfile)

infile.close()
outfile.close()
input("Done.")

How to speed up (fasta) subsampling program for Python?

1 Answers1