0

I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. I am looking at the RecordLinkage package and function levenshteinSim. I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other?

tamcle
  • 15
  • 5
  • Possible duplicate of [Calculating string similarity as a percentage](https://stackoverflow.com/questions/46446485/calculating-string-similarity-as-a-percentage) – Mako212 Jun 26 '19 at 20:42

2 Answers2

1

The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. We will use the levenshteinSim() function from the RecordLinkage package.

Package:

install.packages("RecordLinkage")
library(RecordLinkage)

Find those 90% matches:

data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu"   "tcgart"     "tckael"     "tcgatcgatc"   "tcgatcgatcg"

matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92

matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE  TRUE

So with this function you will be able to get the rows that matches 90% (or greater like in my example). You can then use those % matches the way you wanted to.

Please note that the str1 and str2 arguments from the levenshteinSim() function need to be character vectors.

For more informations go on https://cran.r-project.org/package=RecordLinkage .

Gainz
  • 1,721
  • 9
  • 24
  • is there any way to automatically compare each of the values of the list against each other without manually inputting them all? – tamcle Jun 26 '19 at 22:52
  • Do you have an example of the output you would like to have? – Gainz Jun 27 '19 at 13:28
  • for instance, using the sequences you have ("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg"), instead of only comparing 'tcgatcgatcga' to the rest of the data like you showed, i would like to compare "tcgartyu" to all the data, then "tcgart" to "tckael", "tcgatcgatc", "tcgatcgatcg" then "tckael" to "tcgatcgatc", "tcgatcgatcg" and lastly "tcgatcgatc" and "tcgatcgatcg" so that all possible combinations of testing similarity are covered – tamcle Jun 27 '19 at 22:14
0

I would recommend you look at that string distance package. Specifically, this stringdist() function which gives you a numeric output related to how far one string is from another. You should be able to play around with thresholds to suit your purposes.

https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

Best, Mostafa