Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)

Question

I have a file containing the following text:

>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC

If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.

Expected output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

I have tried sed -r '/^.{,5}$/d', but it also deletes the lines with ">".

HatLess · Answer 1 · 2022-07-17T20:11:10.747

3

Using sed

$ sed '/^>/N;/\n[A-Z]\{6,\}$/!d' input_file
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

edited Jul 17 '22 at 20:11

answered Jul 17 '22 at 20:02

HatLess

10,622
5
14
32

score 3 · Answer 2 · answered Jul 18 '22 at 04:35

With your shown samples, with awk you could try following awk code. Simple explanation would be, if a line starts from > then setting variable val with value of current line and next will skip all further statements from here. Then if a line doesn't start from > if length of current line is greater than 5 then printing val ORS and current line.

awk '/^>/{val=$0;next} length($0)>5{print val ORS $0}' Input_file

Wiktor Stribiżew · Accepted Answer · 2022-07-18T07:39:39.613

With a GNU sed, you can use

sed -E '/>/N;/\n[^>].{0,4}$/d'

Details:

/>/ - finds lines with > (if it must be at the start, add ^ before >)
N - reads the line and appends it to the pattern space with a leading newline
\n[^>].{0,4}$ - a newline, a char other than a > (as the first char should not be >) and then zero to four chars till end of the string
d removes the value in pattern space.

See the online demo:

#!/bin/bash
s='>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC'
sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"

Output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

score 2 · Answer 4 · answered Jul 18 '22 at 14:21

Do not reinvent the wheel. Use common bioinformatics tools for that, such as seqtk or seqkit. Among other things, these tools can handle multiline FASTA sequences. Examples:

seqtk seq -L 5 in.fasta > out.fasta

seqkit seq -m 5 in.fa > out.fasta

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

REFERENCES:

Remove sequences <300 bases from FASTA file: https://www.biostars.org/p/329680/
seqtk: https://github.com/lh3/seqtk
seqkit: https://bioinf.shenwei.me/seqkit/
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

score 1 · Answer 5 · answered Jul 18 '22 at 22:16

very inelegant solution just to get it to work =(

{m,g}awk ' !_<NR && 5 < length($(NF-=_==$NF))\
                 &&                           \
         ($!NF =">seq" $!_ ORS $+NF)^_'  FS='[ \t\n]+' OFS='\n' RS='\n*>seq'

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

score 0 · Answer 6 · answered Jul 18 '22 at 16:41

I would recommend Biopython for this. It makes FASTA processing convenient.

'''
Biopython version: 1.79
'''
from Bio import SeqIO
for seq_record in SeqIO.parse("input.fasta", "fasta"):
    if len(seq_record.seq)>=5:
        print(">",seq_record.id)
        print(seq_record.seq)

Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)

6 Answers6