How to locate the position of non-missing sequence in the fasta

Question

I have a fasta file including about 1,000 sequences

each sequence have head and tail with missing replaced with 'N' like this

>CHR1
NNNNNAAAGAGAGAGNNTTTAGAGAGGGACNNNNNN

I want to get the start and end position of the target sequence (if there are N in the middle of target sequence, is OK)

for my example with 36 nucleotide, the start and the end position of the target sequence is CHR1:6-30 (because there are 5 N in the head and 6 N in the tail)

which means I want to know the position of the first nucleotide which is not N from both ends of the full sequence

does anyone know how to make it for all of the sequences in the same fasta file?

thanks a lot

score 0 · Answer 1 · answered Oct 12 '22 at 02:44

One way using biopython:

from Bio import SeqIO


for rec in SeqIO.parse("file.fa", "fasta"):
    seq_length = len(rec.seq)

    leading_ns = seq_length - len(rec.seq.lstrip('N'))
    lagging_ns = seq_length - len(rec.seq.rstrip('N'))

    start = leading_ns + 1
    stop = seq_length - lagging_ns

    print(f'{rec.id}:{start}-{stop}')

How to locate the position of non-missing sequence in the fasta

1 Answers1