I am trying to efficiently map exact peptides (short sequences of amino acids in the 26 character alphabet A-Z1) to proteins (longer sequences of the same alphabet). The most efficient way to do this I'm aware of is an Aho-Corasick
trie (where peptides are the keywords). Unfortunately I can't find a version of AC in R that will work with a non-nucleotide alphabet (Biostrings' PDict
and Starr's match_ac
are both hard-coded for DNA).
As a crutch I've been trying to parallelize a basic grep approach. But I'm having trouble figuring out a way to do so without incurring significant IO overhead. Here is a brief example:
peptides = c("FSSSGGGGGGGR","GAHLQGGAK","GGSGGSYGGGGSGGGYGGGSGSR","IISNASCTTNCLAPLAK")
if (!exists("proteins"))
{
biocLite("biomaRt", ask=F, suppressUpdates=T, suppressAutoUpdate=T)
library(biomaRt)
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
proteins = getBM(attributes=c('peptide', 'refseq_peptide'), filters='refseq_peptide', values=c("NP_000217", "NP_001276675"), mart=ensembl)
row.names(proteins) = proteins$refseq_peptide
}
library(snowfall)
library(Biostrings)
library(plyr)
sfInit(parallel=T, cpus=detectCores()-1)
allPeptideInstances = NULL
i=1
increment=100
count=nrow(proteins)
while(T)
{
print(paste(i, min(count, i+increment), sep=":"))
text_source = proteins[i:min(count, i+increment),]
text = text_source$peptide
#peptideInstances = sapply(peptides, regexpr, text, fixed=T, useBytes=T)
peptideInstances = sfSapply(peptides, regexpr, text, fixed=T, useBytes=T)
dimnames(peptideInstances) = list(text_source$refseq_peptide, colnames(peptideInstances))
sparsePeptideInstances = alply(peptideInstances, 2, .fun = function(x) {x[x > 0]}, .dims = T)
allPeptideInstances = c(allPeptideInstances, sparsePeptideInstances, recursive=T)
if (i==count | nrow(text_source) < increment)
break
i = i+increment
}
sfStop()
There are a few issues here:
peptideInstances
here is a dense matrix, so returning it from each worker is very verbose. I have broken it up into blocks so that I'm not dealing with a 40,000 (proteins) x 60,000 (peptides) matrix.- Parallelizing over peptides, when it would make more sense to parallelize over the proteins because they're bigger. But I got frustrated with trying to do it by protein because:
- This code breaks if there is only one protein in text_source.
Alternatively, if anyone is aware of a better solution in R, I'm happy to use that. I've spent enough time on this I probably would have been better served implementing Aho-Corasick.
1 Some of those are ambiguity codes, but for simplicity, ignore that.