Calculating string similarity as a percentage

Question

The given function uses "stringdist" package in R and tells the minimum changes needed to change one string to another. I wish to find out how much similar is one string to another in "%" format. Please help me and thanks.

stringdist("abc","abcd", method = "lv")

Maybe [this answer](https://stackoverflow.com/a/23188848/5215145) will be useful. It really depends on your definition as similarity. — Andrey Kolyadin, Sep 27 '17 at 11:16

score 8 · Accepted Answer · answered Sep 27 '17 at 11:28

8

You can use RecordLinkage package and use the function levenshteinSim, i.e.

#This gives the similarity
RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.75

#so to get the distance just subtract from 1, 
1 - RecordLinkage::levenshteinSim('abc', 'abcd')
#[1] 0.25

answered Sep 27 '17 at 11:28

Sotos

51,121
6
32
66

I didn't know about this package - really cool. According to the documentation, there is even a function `levenshteinDist` that directly calculates distance. – A. Stam Sep 27 '17 at 11:31
@A.Stam Yup. However that distance is not normalized – Sotos Sep 27 '17 at 11:32
I need this only, can you give me the same result in a % – Ashmin Kaul Sep 27 '17 at 11:35
Hi, I want to know, percent(1:10) using scales package gives me values in % but of character data type, I wish to represent numbers in % but also make this numeric, please help. – Ashmin Kaul Oct 05 '17 at 08:51
You can't have the symbol `%` after a value and set it as numeric. That is why we write the percentages as `0,...`. – Sotos Oct 05 '17 at 08:58

A. Stam · Answer 2 · 2017-09-27T11:32:52.443

3

Something like this might work:

d <- data.frame(original = c("abcd", "defg", "hij"), new = c("abce", "zxyv", "hijk"))
d$dist <- stringdist(d$original, d$new, method = "lv")
d$similarity <- 1 - d$dist / nchar(as.character(d$original))

#### Returns:
####   original  new dist similarity
#### 1     abcd abce    1  0.7500000
#### 2     defg zxyv    4  0.0000000
#### 3      hij hijk    1  0.6666667

edited Sep 27 '17 at 11:32

answered Sep 27 '17 at 11:26

A. Stam

2,148
14
29

Hey very close, if I see your first string, I should get 0.75 instead of 0.25, which represents 75 % similarity between the strings, similarly second string should be 0%, as they are completely dissimilar. Thanks for the help. – Ashmin Kaul Sep 27 '17 at 11:31
I've changed my answer to calculate similarity instead of distance. – A. Stam Sep 27 '17 at 11:33
I need the similarity figure in percentage, Thank you so much for your help – Ashmin Kaul Sep 27 '17 at 11:37
What do you mean by "in percentage"? You can just multiply the result by 100 if you want 75 instead of 0.75. Or is it something else you need? – A. Stam Sep 27 '17 at 11:45
Thanks a lot for your help. – Ashmin Kaul Sep 27 '17 at 12:07
Please note that when using `method = "lv"`, one should divide by `max(nchar(d$original), nchar(d$new))`. Otherwise results will not range from 0 to 100 but can become negative when the number of characters of the second string is larger than the number of characters of the first string. – ToWii Dec 04 '22 at 17:32

score 2 · Answer 3 · answered Sep 27 '17 at 11:34

Here is a function in base R. I added a check for vectors of equal length as inputs. You could change this logic if desired.

strSim <- function(v1, v2) {
            if(length(v1) == length(v2)) 1 - (adist(v1, v2) / pmax(nchar(v1), nchar(v2)))
            else stop("vector lengths not equal")}

this returns

strSim("abc", "abcd")
     [,1]
[1,] 0.75

Calculating string similarity as a percentage

3 Answers3

Linked