This is a question for anyone familiar with the 'stringdist' package.
I am trying to write a function that does the following:
Searches a very long list of characters such as this (only 16 of ~1 million shown):
> stripList
[1] "AAAAAAAAAAAAAAAAAAAAAAAAAAAADAABAAADCDDAD" "BAAAABBBDACDBABAAADDCBDADBCCBDCDDCDBCDDBA"
[3] "BDDABDCCAAABABBAACADCBDADBCCBDCDDCDBCDDBA" "AADBBACDDDBABDCABAADBCADCBDDDCCC"
[5] "BBCDBBDCCBABDBCABDBBDBDDDADCDDADDDCDDCDDD" "BDDCDACABDCCBACBADCDCBDADBCCBDCDDCDDCDDBA"
[7] "BCDBADCBBDDBBBBDCBDADBCCBDCDDCDBCDDDDAAAA" "DABDDCDACABDCCBACBADC"
[9] "CABABDDCCCCACDCCDCCDADCAAAAAAAAACADADDADA" "BAABCBBBDBCDCDDADDDDCDDADBCCBDCDD"
[11] "BBDDDACDCABDDDBBACDCBDADBCCDDCDDCDDCDDBDD" "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[13] "BDDBBBBDDBDABBACDBDCBDADBCCBDCDD" "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[15] "DABDDCDACABDCCBACBADC" "BBADBACDDBABAACABCABCDCBDADBCCBDCDDCDDDDD"
For instances of each sequence of a query sequences list that is structured like this.
Ex:
SeqName1 # queryNames
BBCDBBDCCBABDBCA # querySeqs
SeqName2 # queryNames
BBBDCCDCCCCDDDCAAACD # querySeqs
I want to see how many times (if any) the query sequence shows up in any of my 'stripList' and allow for 1 insertion, 1 deletion, 1 substitution, and 1 transposition, and get an output like this:
>dt
queryNames TimesFound
SeqName1 5
seqName2 145
To do so I am using the 'amatch' function of the 'stringdist' package in the following manner:
dt<-rapply(as.list(querySeqs), function(x) amatch(x, stripList, method = "osa", useBytes = TRUE, weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9))
dt<-data.frame(dt)
colnames(dt) <- "TimesFound"
dt<-cbind(queryNames,dt)
I have a few questions:
In the 'amatch' function, when using method = "osa", how is the "weight" argument interpreted? As an example, if I were to use:
method = "osa", weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9
am I saying that I want 90% matching of my "querySeqs"? Meaning, do those fractions pertain to "querySeqs" or my table (stripList)?
What function does "maxDist" have? (Is it also interpreted as a percent?)
Is there a way to maximize the runtime efficiency of my code above (by perhaps using data.table, etc)? I only ask because my actual datasets are ~2000 sequence queries being searched through ~1,000,000 sequence lists.
Is there a better way than 'amatch' to look for whole sequences (not just substrings of them like 'agrep' does)?
I apologize if these are elementary questions, the documentation on this is vague to me and frankly, Im still learning.
Thanks in advance.