I am trying to scan for possible SNPs
and indels
by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor
and the `pairwiseAlignment function from the Biostrings package.
This was working fine for smaller scaffolds, but failed when I tried to align as 56kbp scaffold with the error message:
Error in QualityScaledXStringSet.pairwiseAlignment(pattern = pattern, : cannot allocate memory block of size 17179869183.7 Gb
I am not sure if this is a bug or not ? ; I was under the impression that the Needleman-Wunsch algorithm
used by pairwiseAlignment
is an O(n*m)
which I thought would imply the computational demand to be on the order of 3.1E9
operations (56K * 56k ~= 3.1E9)
. It seems the Needleman-Wunsch
similarity matrix should as well take up on the order of 3.1 gigs of memory as well. Not sure if I'm not remembering big-o notation correctly or that is actually the memory overhead that would be needed to build the alignment given the overhead of the R scripting
environment.
Does anybody have suggestions for a better alignment algorithm to use for aligning longer sequences? An initial alignment was already done using BLAST to find the region of the reference genome to align. I am not entirely confident BLAST
's reliability for correctly placing indels and I have not yet been able to find an api as good as that provided by biostrings for parsing the raw BLAST alignments.
By the way, here is a code snippet that replicates the problem:
library("Biostrings")
scaffold_set = read.DNAStringSet(scaffold_file_name) #scaffold_set is a DNAStringSet instance
scafseq = scaffold_set[[scaffold_name]] #scaf_seq is a "DNAString" instance
genome = read.DNAStringSet(genome_file_name)[[1]] #genome is a "DNAString" instance
#qstart, qend, substart, subend are all from intial BLAST alignment step
scaf_sub = subseq(scafseq, start=qstart, end=qend) #56170-letter "DNAString" instance
genomic_sub = subseq(genome, start=substart, end=subend) #56168-letter "DNAString" instance
curalign = pairwiseAlignment(pattern = scaf_sub, subject = genomic_sub)
#that last line gives the error:
#Error in .Call2("XStringSet_align_pairwiseAlignment", pattern, subject, :
#cannot allocate memory block of size 17179869182.9 Gb
The error does not happen with shorter alignments (hundreds of bases). I have not yet found the length cutoff where the error starts happening