String matching using 'stringdist' and 'amatch' in R

Question

This is a question for anyone familiar with the 'stringdist' package.

I am trying to write a function that does the following:

Searches a very long list of characters such as this (only 16 of ~1 million shown):

> stripList
[1] "AAAAAAAAAAAAAAAAAAAAAAAAAAAADAABAAADCDDAD" "BAAAABBBDACDBABAAADDCBDADBCCBDCDDCDBCDDBA"
[3] "BDDABDCCAAABABBAACADCBDADBCCBDCDDCDBCDDBA" "AADBBACDDDBABDCABAADBCADCBDDDCCC"         
[5] "BBCDBBDCCBABDBCABDBBDBDDDADCDDADDDCDDCDDD" "BDDCDACABDCCBACBADCDCBDADBCCBDCDDCDDCDDBA"
[7] "BCDBADCBBDDBBBBDCBDADBCCBDCDDCDBCDDDDAAAA" "DABDDCDACABDCCBACBADC"                    
[9] "CABABDDCCCCACDCCDCCDADCAAAAAAAAACADADDADA" "BAABCBBBDBCDCDDADDDDCDDADBCCBDCDD"        
[11] "BBDDDACDCABDDDBBACDCBDADBCCDDCDDCDDCDDBDD" "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[13] "BDDBBBBDDBDABBACDBDCBDADBCCBDCDD"          "BDDABDCCAAABABBBACADCBDADBCCBDCDDCDBCDDBA"
[15] "DABDDCDACABDCCBACBADC"                     "BBADBACDDBABAACABCABCDCBDADBCCBDCDDCDDDDD"

For instances of each sequence of a query sequences list that is structured like this.

Ex:

SeqName1              # queryNames
BBCDBBDCCBABDBCA      # querySeqs

SeqName2              # queryNames
BBBDCCDCCCCDDDCAAACD  # querySeqs

I want to see how many times (if any) the query sequence shows up in any of my 'stripList' and allow for 1 insertion, 1 deletion, 1 substitution, and 1 transposition, and get an output like this:

>dt
 queryNames    TimesFound
 SeqName1           5
 seqName2          145

To do so I am using the 'amatch' function of the 'stringdist' package in the following manner:

dt<-rapply(as.list(querySeqs), function(x) amatch(x, stripList, method = "osa", useBytes = TRUE, weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9))
dt<-data.frame(dt)
colnames(dt) <- "TimesFound"
dt<-cbind(queryNames,dt)

I have a few questions:

In the 'amatch' function, when using method = "osa", how is the "weight" argument interpreted? As an example, if I were to use:
```
method = "osa", weight = c(d = 0.5, i = 0.5, s = 0.9, t = 0.9), maxDist=0.9
```
am I saying that I want 90% matching of my "querySeqs"? Meaning, do those fractions pertain to "querySeqs" or my table (stripList)?
What function does "maxDist" have? (Is it also interpreted as a percent?)
Is there a way to maximize the runtime efficiency of my code above (by perhaps using data.table, etc)? I only ask because my actual datasets are ~2000 sequence queries being searched through ~1,000,000 sequence lists.
Is there a better way than 'amatch' to look for whole sequences (not just substrings of them like 'agrep' does)?

I apologize if these are elementary questions, the documentation on this is vague to me and frankly, Im still learning.

Thanks in advance.

score 6 · Answer 1 · edited Jan 03 '16 at 20:38

This question seems to be up here for a while, and I've only just found it. In short:

(1) The weights are penalties for each action. It allows you to to tell amatch, that e.g. a deletion or insertion is ok, but you thing a transposition should be punished more. Judging by your question, you can probably leave the weights as they are.

(2) Maxdist tells amatch that if two strings are more than maxDist apart, they will never be considered a match. The default is zero so only exact matches are allowed. It is not a percentage. Relevant values of maxDist depend on what distance function is used. I think that you could use method='osa', maxDist=1 (allowing for a single transposition, insertion, deletion or substitution, but no combinations) Or possibly maxDist=4 if you're willing to allow combinations of up to four edits. For the edit-like distances, the distance is bound by the number of characters in the largest string. Please see the R-journal paper for the ranges of all supported distances. http://journal.r-project.org/archive/2014-1/loo.pdf

(3) I am optimizing the code all the time. Version 0.9 will use multithreading. I see you are using rapply, you could probably avoid this by just using

amatch(querySeqs,stripList,method='osa',maxDist=4)

(4) At the moment, I think that amatch is the best implementation for R (but as I'm the author, I may be biased :)).

String matching using 'stringdist' and 'amatch' in R

1 Answers1