3

I have a file containing the following text:

>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC

If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.

Expected output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

I have tried sed -r '/^.{,5}$/d', but it also deletes the lines with ">".

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Honorato
  • 111
  • 6

6 Answers6

3

Using sed

$ sed '/^>/N;/\n[A-Z]\{6,\}$/!d' input_file
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
HatLess
  • 10,622
  • 5
  • 14
  • 32
3

With your shown samples, with awk you could try following awk code. Simple explanation would be, if a line starts from > then setting variable val with value of current line and next will skip all further statements from here. Then if a line doesn't start from > if length of current line is greater than 5 then printing val ORS and current line.

awk '/^>/{val=$0;next} length($0)>5{print val ORS $0}' Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
2

With a GNU sed, you can use

sed -E '/>/N;/\n[^>].{0,4}$/d'

Details:

  • />/ - finds lines with > (if it must be at the start, add ^ before >)
  • N - reads the line and appends it to the pattern space with a leading newline
  • \n[^>].{0,4}$ - a newline, a char other than a > (as the first char should not be >) and then zero to four chars till end of the string
  • d removes the value in pattern space.

See the online demo:

#!/bin/bash
s='>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC'
sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"

Output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Do not reinvent the wheel. Use common bioinformatics tools for that, such as seqtk or seqkit. Among other things, these tools can handle multiline FASTA sequences. Examples:

seqtk seq -L 5 in.fasta > out.fasta

seqkit seq -m 5 in.fa > out.fasta

To install these tools, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

REFERENCES:

Remove sequences <300 bases from FASTA file: https://www.biostars.org/p/329680/
seqtk: https://github.com/lh3/seqtk
seqkit: https://bioinf.shenwei.me/seqkit/
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
1

very inelegant solution just to get it to work =(

{m,g}awk ' !_<NR && 5 < length($(NF-=_==$NF))\
                 &&                           \
         ($!NF =">seq" $!_ ORS $+NF)^_'  FS='[ \t\n]+' OFS='\n' RS='\n*>seq'
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

I would recommend Biopython for this. It makes FASTA processing convenient.

'''
Biopython version: 1.79
'''
from Bio import SeqIO
for seq_record in SeqIO.parse("input.fasta", "fasta"):
    if len(seq_record.seq)>=5:
        print(">",seq_record.id)
        print(seq_record.seq)
Supertech
  • 746
  • 1
  • 9
  • 25