5

I have a file that has thousands of accession numbers:

and looks like this..

>NC_033829.1 Kallithea virus isolate DrosEU46_Kharkiv_2014, complete genome
AGTCAGCAACGTCGATGTGGCGTACAATTTCTTGATTACATTTTTGTTCCTAACAAAATGTTGATATACT

>NC_020414.2 Escherichia phage UAB_Phi78, complete genome
TAGGCGTGTGTCAGGTCTCTCGGCCTCGGCCTCGCCGGGATGTCCCCATAGGGTGCCTGTGGGCGCTAGG

If want to split this to multiple files with one accession number each then I can use the following code

awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < yourfile.fa

I have a file with thousands of accession numbers (aka >NC_*) and want to split it such as each files contains ~ 5000 accession numbers. Since I am new to awk/bash/python i struggle to find a neat solution

Any idea or comment are appreciated

LDT
  • 2,856
  • 2
  • 15
  • 32

4 Answers4

3

It wasn't clear from your question that an "accession number" is unique per input block (don't assume the people reading your question know anything about your domain - it's all just lines of text to us). It would have been clearer if you had phrased your question to just say you want 5000 new-line-separated blocks per output file rather than 5000 accession numbers.

Having seen the answer you posted, it's now clear that this is what you should be using:

awk -v RS= -v ORS='\n\n' '
    (NR%5000) == 1 { close(out); out="myseq"(++n_seq)".fa" }
    { print > out }
' my_sequences.fa
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

Assumptions: sections are separated by empty lines.

Algorithm:

  • split file on sections
  • extract accession number from section
  • output section to a filename named with accession number.

Awk terms: a "record" will be our section - part of file separated by empty line (i.e. two newline characters one after another. A "field" is usually separated by spaces - by separating by space or > character second field will be accession number.

Just set record separator to two newlines and field separator to > or space and then output the line to a filenamed named with second field:

awk -v RS='' -v FS='[> ]' '{f=($2 ".txt"); print >> f; close(f)}'

@edit changed > to >> and RS='\n\n' to RS=''

@edit and also added close

KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • still I think I have not understood how to maintain several accession numbers in the same file. Lets say I had four NC how can I split my files so as to have two NCs to the one and two NC to the other. Which parameter I change. I am sorry for my stupid question – LDT Jul 25 '21 at 20:54
  • 2
    @LDT are you saying that the answer you accepted doesn't actually do what you want? It will do what you asked for in your question so you should ask a new question if what you asked for isn't what you actually wanted. – Ed Morton Jul 26 '21 at 13:13
  • Dear @EdMorton, yes exactly. I worked to add a solution. However thank you so muck for your time. I have learnt a lot and upvoted your answer – LDT Jul 27 '21 at 08:35
  • @LDT you're welcome but I hadn't posted an answer. I have now. – Ed Morton Jul 27 '21 at 12:46
2

Best to use Biopython's Bio.SeqIO to handle the reading and writing of FASTA files. Then all you need is some way to group the records (SeqRecord objects) as desired. My preference is to have the grouping function yield iterators:

from itertools import chain, islice

from Bio import SeqIO


def grouper(n, iterable):
    it = iter(iterable)
    while True:
        chunk_it = islice(it, n)
        try:
            first = next(chunk_it)
        except StopIteration:
            return
        yield chain((first,), chunk_it)


for idx, group in enumerate(grouper(5000, SeqIO.parse('input.fa', 'fasta')), 1):
    SeqIO.write(group, f'out-{idx}.fa', 'fasta')
Steve
  • 51,466
  • 13
  • 89
  • 103
  • Why do you need the `grouper` function here? can you use `itertools.islice()` directly? – Chris_Rands Aug 04 '21 at 18:53
  • @Chris_Rands Well [itertools.islice()](https://docs.python.org/3/library/itertools.html#itertools.islice) just returns an iterator over the selected elements. What's needed is some way to iterate over all elements, hence the function. – Steve Aug 05 '21 at 11:29
0

Thank you a lot for your answers I have learnt a lot. What I really want to do is to split the multifasta file in files with the same number of accession numbers. Here is my answer after a long battle and the help of a colleague

awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%5000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < my_sequences.fa

Here you create new fasta files where each file has 5000 accession numbers aka headers

thank you all

LDT
  • 2,856
  • 2
  • 15
  • 32
  • You don't need the BEGIN section, you shouldn't have identical print statements with a `next` between them as that's redundant code, it will fail with "too many open files" in some awks if your input file is large and in gawk it'll slow down as you aren;'t closing the output files as you go and redirecting input from shell isn't necessary, awk is perfectly capable of opening an input file and then it can use FILENAME. – Ed Morton Jul 27 '21 at 12:56