1

I'm sure this is an easy-to-do thing, but I have very limited bioinformatic experience.

I have many -100,000- FASTA files that contain alignments of different genes of the same 12 species. Each file looks something like this:

>dmel
ACTTTTGATACAATTAAC
>dsim
AATCCCAGACAAATTAAG
>dsec
AGTTTTGCAATGGTAAAT
>dere
TGGAATATTAGACGAATT 
...

Not all of the files are ordered in the same way and I would like them all to be. They could be sorted alphabetically if this is easier, it doesn't matter how they are ordered as long as all of the files are sorted equally. Alphabetically should be like:

>dere
TGGAATATTAGACGAATT
>dmel
ACTTTTGATACAATTAAC
>dsec
AGTTTTGCAATGGTAAAT
>dsim
AATCCCAGACAAATTAAG
...

Any script that does this automatically would be much appreciated.

Edit: I have been using a shell script using sed that works but is problematic. It works when the number of files is not that huge but in this particular case it creates duplicated files with different names. The script reads:

#!/bin/bash
echo
for i in {0..114172}; do
#sed -e '$ d' bloque.fasta.trim$i >b0.fasta.trim
#sed -e 's/ /ñ/g' <b0.fasta.trim >b1.fasta.trim
sed -e 's/ /ñ/g' <bloque.fasta.trim$i >b1.fasta.trim
tr "\n" " " <b1.fasta.trim >b2.fasta.trim
sed -e 's/ //g' < b2.fasta.trim >b3.fasta.trim
sed -e 's/>/\n>/g' < b3.fasta.trim >b4.fasta.trim
sed '1d' b4.fasta.trim >b5.fasta.trim
sort b5.fasta.trim >b6.fasta.trim 
sed -e 's/ñ/\n/g' < b6.fasta.trim >b7.fasta.trim$i
done

The non-ordered files are called bloque.fasta.trim, this script creates a bunch of files called b7.fasta.trim$ that should create one b7. file for each bloque. file. The problem is that sometimes it duplicates a file but name them differently. I am sure there most be an easier approach that doesn't make duplication mistakes.

NKGon
  • 55
  • 8
  • 1
    Try BioStar or SEQanswers. If you want a solution from Stack Overflow then you need to state the programming language and show your coding attempt. – Chris_Rands Sep 01 '16 at 14:19
  • OK, I edited the post to show my sed script that works but with limitations – NKGon Sep 01 '16 at 15:58

1 Answers1

1

Any script that does this automatically would be much appreciated.

I don't know if this is exactly what you want, but it's easy to sort fasta files using biopython.

First, install the module:

# If using debian/ubuntu
sudo apt-get install python-biopython

# If other operational system, install pip and
pip install biopython

Now, write this code in a file, e.g.: fasta_sorter.py

from Bio import SeqIO
import sys

infile = sys.argv[1]

records = SeqIO.parse(open(infile, 'r'), 'fasta')

records_dict = SeqIO.to_dict(records)

for rec in sorted(records_dict):
    print ">%s\n%s" % (rec, records_dict[rec].seq)

After that, you can sort each of your files with:

python fasta_sorter.py /path/to/your.fasta > file.sorted.fasta

You can put it in your for loop.

taniguti
  • 38
  • 5
  • It worked, thanks. It took quite a long time around 2 hours and it gave the following error message for each file Traceback (most recent call last): File "fasta_sorter.py", line 6, in records = SeqIO.parse(open(infile, 'r'), 'fasta') IOError: [Errno 2] No such file or directory: 'bloque.fasta.trim114172' – NKGon Sep 01 '16 at 20:54
  • @NKGon, seems like you do not have the files used as input to the sorter. e.g.: bloque.fasta.trim114172 – taniguti Sep 01 '16 at 21:48
  • I thought so, yet the files are there and the output file were created as well. I have no explanation but somehow it works. – NKGon Sep 01 '16 at 22:08