3

The given function uses "stringdist" package in R and tells the minimum changes needed to change one string to another. I wish to find out how much similar is one string to another in "%" format. Please help me and thanks.

stringdist("abc","abcd", method = "lv")
Jaap
  • 81,064
  • 34
  • 182
  • 193
Ashmin Kaul
  • 860
  • 2
  • 12
  • 37
  • Maybe [this answer](https://stackoverflow.com/a/23188848/5215145) will be useful. It really depends on your definition as similarity. – Andrey Kolyadin Sep 27 '17 at 11:16

3 Answers3

8

You can use RecordLinkage package and use the function levenshteinSim, i.e.

#This gives the similarity
RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.75

#so to get the distance just subtract from 1, 
1 - RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.25
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • I didn't know about this package - really cool. According to the documentation, there is even a function `levenshteinDist` that directly calculates distance. – A. Stam Sep 27 '17 at 11:31
  • @A.Stam Yup. However that distance is not normalized – Sotos Sep 27 '17 at 11:32
  • I need this only, can you give me the same result in a % – Ashmin Kaul Sep 27 '17 at 11:35
  • Hi, I want to know, percent(1:10) using scales package gives me values in % but of character data type, I wish to represent numbers in % but also make this numeric, please help. – Ashmin Kaul Oct 05 '17 at 08:51
  • You can't have the symbol `%` after a value and set it as numeric. That is why we write the percentages as `0,...`. – Sotos Oct 05 '17 at 08:58
3

Something like this might work:

d <- data.frame(original = c("abcd", "defg", "hij"), new = c("abce", "zxyv", "hijk"))
d$dist <- stringdist(d$original, d$new, method = "lv")
d$similarity <- 1 - d$dist / nchar(as.character(d$original))

#### Returns:
####   original  new dist similarity
#### 1     abcd abce    1  0.7500000
#### 2     defg zxyv    4  0.0000000
#### 3      hij hijk    1  0.6666667
A. Stam
  • 2,148
  • 14
  • 29
  • Hey very close, if I see your first string, I should get 0.75 instead of 0.25, which represents 75 % similarity between the strings, similarly second string should be 0%, as they are completely dissimilar. Thanks for the help. – Ashmin Kaul Sep 27 '17 at 11:31
  • I've changed my answer to calculate similarity instead of distance. – A. Stam Sep 27 '17 at 11:33
  • I need the similarity figure in percentage, Thank you so much for your help – Ashmin Kaul Sep 27 '17 at 11:37
  • What do you mean by "in percentage"? You can just multiply the result by 100 if you want 75 instead of 0.75. Or is it something else you need? – A. Stam Sep 27 '17 at 11:45
  • Thanks a lot for your help. – Ashmin Kaul Sep 27 '17 at 12:07
  • Please note that when using `method = "lv"`, one should divide by `max(nchar(d$original), nchar(d$new))`. Otherwise results will not range from 0 to 100 but can become negative when the number of characters of the second string is larger than the number of characters of the first string. – ToWii Dec 04 '22 at 17:32
2

Here is a function in base R. I added a check for vectors of equal length as inputs. You could change this logic if desired.

strSim <- function(v1, v2) {
            if(length(v1) == length(v2)) 1 - (adist(v1, v2) / pmax(nchar(v1), nchar(v2)))
            else stop("vector lengths not equal")}

this returns

strSim("abc", "abcd")
     [,1]
[1,] 0.75
lmo
  • 37,904
  • 9
  • 56
  • 69