0

I have been stuck on a problem for three days... searched everywhere, posted on Biostar, still waiting for EMBL to respond to emails... would make a bounty if I had more rep.

After aligning sequences with EMBOSSwin needle() (pairwise global alignments) I get alignment files in pair format, with a .needle file extension. I want to use Biopython to read these alignments for later analysis.

I use AlignIO.read(open('alignment.needle'),'emboss') following the instructions in Biopython's AlignIO wiki but I keep getting an AssertionError.

My code:

>>> from Bio import AlignIO
>>> alignment = AlignIO.read(open("data/all/out/pair1_alignment.needle"), "emboss")

My error:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\Python27\lib\Bio\AlignIO\__init__.py", line 423, in read
    first = next(iterator)
  File "C:\Python27\lib\Bio\AlignIO\__init__.py", line 370, in parse
    for a in i:
  File "C:\Python27\lib\Bio\AlignIO\EmbossIO.py", line 150, in __next__
    assert seq.replace("-", "") != ""
AssertionError

Example Alignment File:

Download the alignment file here

Picture of alignment file

Versions:

  • Windows 7
  • Python version 2.7.3
  • Biopython version 1.63
  • EMBOSS version 2.10.0-0.8

Clues:

I suspect this may be related to a warning message I kept getting when actually making the alignments, which was outputted by EMBOSS needle() function:

Warning: Sequence character string not found in ajSeqCvtKS
Community
  • 1
  • 1
hello_there_andy
  • 2,039
  • 2
  • 21
  • 51
  • To state the obvious, Biopython's error means that your FASTA file has at least one empty sequence (a string of zero or gaps). Could you post your FASTA file? – David Cain Nov 25 '13 at 14:39
  • 1
    Thank you for reading @David, here is the file: https://www.dropbox.com/s/q3bw9zpiwb0nqz3/pair1_aligned.fasta – hello_there_andy Nov 25 '13 at 14:54

2 Answers2

2

Duplicate post on BioStars, http://www.biostars.org/p/87226/#87399

This appears to be down to a subtle change in the EMBOSS output. You have an extremely old version, EMBOSS version 2.10.0 (February 2005), and your output file has lines like this:

gag             1288 --------------------------------------------------   1287

Using a newer version of EMBOSS (e.g. 6.3.0), gives lines like this:

gag             1287 --------------------------------------------------   1287

The Biopython parser is expecting the latter for alignment sections with no letters (e.g. when one sequence is much longer than the other), where the start and end coordinates agree. Please update your copy of EMBOSS, and then the parser should be happy. The current EMBOSS release is version 6.5.0.

Peter Cock
  • 1,585
  • 1
  • 9
  • 14
  • 1
    My goodness that was it... an old version of EMBOSS! Good grief.. I feel quite silly now, but thank you incredibly much for solving the problem. – hello_there_andy Nov 27 '13 at 01:35
1

The problem is that you're passing the wrong format file to Biopython. An explanation follows.

Formatting

The format of the file you've linked to is srspair (see the header of pair1_aligned.fasta). It's worth noting that this is not the FASTA format - that's an entirely different format.

Delving into the source of Biopython's EmbossIO, we can see that the EmbossIterator (which is called by AlignIO.read when the format is 'emboss') is only meant to handle the formats pair and simple (see Alignment formats for an explanation of the various formats).

Solution

If you export EMBOSS's output in the pair format (then call AlignIO.read as you have before), that should solve your problem.

David Cain
  • 16,484
  • 14
  • 65
  • 75
  • 1
    The error persists! But thank you so much for the insight. I did what you said by making the output format as "pair" and even changed the output file extension to .needle. But still the AssertionError is returned. – hello_there_andy Nov 26 '13 at 00:40
  • 1
    Here is the new alignment file in "pair" format and .needle extension: http://bit.ly/18CPpAK – hello_there_andy Nov 26 '13 at 00:48
  • 1
    I replied on your duplicate question on Biostars, http://www.biostars.org/p/87226/#87399 - you are using an extremely old version of EMBOSS (from 2005) which produces slightly different output which the Biopython parser does not handle. Please update your EMBOSS and try again. – Peter Cock Nov 26 '13 at 12:08