Looking for algorithm to do long pair wise nucleotide alignments

Question

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package. This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:

Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb

I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm used by pairwiseAlignment is an O(n*m) which I thought would imply the computational demand to be on the order of 3.1E9 operations (56K * 56k ~= 3.1E9). It seems the Needleman-Wunsch similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting environment.

Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.

By the way, here is a code snippet that replicates the problem:

library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance

genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance

#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance

curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error: 
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject,  : 
#cannot allocate memory block of size 17179869182.9 Gb

The error does not happen with shorter alignments (hundreds of bases). I have not yet found the length cutoff where the error starts happening

Sort of looks like an integer overflow; the R you're using has a maximum vector size of 2^31-1, which is 2147483647. Output of `sessionInfo()` would be useful. Also from your problem statement is it necessary to do an alignment across the full sequence, or does it make sense to look at (overlapping) windows of reasonable size? — Martin Morgan, Sep 08 '12 at 00:34
Could you please show the code you used to perform the alignment? (It won't be reproducible without the sequences but at least it will help) — David Robinson, Sep 08 '12 at 03:18
64-bit integer overflow (17179869183.7GiB ~ 2**64 bytes); this is almost certainly a bug in the library. — nneonneo, Sep 08 '12 at 03:27
Possibly related? http://www.mentby.com/Group/bioconductor-list/biostrings-bug.html — nneonneo, Sep 08 '12 at 03:32
> sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] Biostrings_2.20.1 IRanges_1.10.4 bio3d_1.1-4 loaded via a namespace (and not attached): [1] Biobase_2.12.2 tools_2.13.0 — samhiggins2001, Sep 08 '12 at 03:36
@user1655915 thanks for the sessionInfo; I'm not sure whether the integer overflow bug has been fixed in a more recent version (I'd suggest updating to R-2.15.1 and then `source("http://bioconductor.org/biocLite.R") to update your Bioc packages) but even if it has it's likely only to fail with a more reasonable error message. — Martin Morgan, Sep 08 '12 at 13:00
Updated my Bioc packages and included a code snippet in the original question above. — samhiggins2001, Sep 12 '12 at 20:49

score -1 · Answer 1 · answered Oct 09 '12 at 05:43

So I use Clustal as an alignment tool. Not sure about the specific performance, but it has never given me issues when doing multiple sequence alignments of large quantity. Here is a script that runs a whole directory of .fasta files and aligns them. You can modify the flags on the system call to suit your input/output needs. Just look at the clustal documentation. This is in Perl, I don't use R too much for alignments. You need to edit the executable path in the script to match where clustal is on your computer.

#!/usr/bin/perl 


use warnings;

print "Please type the list file name of protein fasta files to align (end the directory    path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;

opendir (DIR,$directory) or die $!;

my @file = readdir DIR;
closedir DIR;

my $add="_align.fasta";

foreach $file (@file) {
 my $infile = "$directory$file";
 (my $fileprefix = $infile) =~ s/\.[^.]+$//;
 my $outfile="$fileprefix$add";
 system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA -tree";
}

Programs like Clustal work best when all the sequences to align are about the same length which won't be the case when aligning scaffolds to the reference — dkatzel, Aug 22 '13 at 01:30

Looking for algorithm to do long pair wise nucleotide alignments

1 Answers1