I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. I am looking at the RecordLinkage package and function levenshteinSim. I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other?
2 Answers
The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. We will use the levenshteinSim()
function from the RecordLinkage
package.
Package:
install.packages("RecordLinkage")
library(RecordLinkage)
Find those 90% matches:
data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu" "tcgart" "tckael" "tcgatcgatc" "tcgatcgatcg"
matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92
matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE TRUE
So with this function you will be able to get the rows that matches 90% (or greater like in my example). You can then use those % matches the way you wanted to.
Please note that the str1
and str2
arguments from the levenshteinSim()
function need to be character vectors.
For more informations go on https://cran.r-project.org/package=RecordLinkage .

- 1,721
- 9
- 24
-
is there any way to automatically compare each of the values of the list against each other without manually inputting them all? – tamcle Jun 26 '19 at 22:52
-
-
for instance, using the sequences you have ("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg"), instead of only comparing 'tcgatcgatcga' to the rest of the data like you showed, i would like to compare "tcgartyu" to all the data, then "tcgart" to "tckael", "tcgatcgatc", "tcgatcgatcg" then "tckael" to "tcgatcgatc", "tcgatcgatcg" and lastly "tcgatcgatc" and "tcgatcgatcg" so that all possible combinations of testing similarity are covered – tamcle Jun 27 '19 at 22:14
I would recommend you look at that string distance package. Specifically, this stringdist() function which gives you a numeric output related to how far one string is from another. You should be able to play around with thresholds to suit your purposes.
https://cran.r-project.org/web/packages/stringdist/stringdist.pdf
Best, Mostafa

- 151
- 5