1

If I have one vector of names, say:

a = c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")

I want to get use levenshteinSim or similar to get a similarity score within this vector. However, I don't want it to self score. For example, "tom" #1 to score against "tom" #3. And not to return a score for "tom" #1 against "tom" #1 so not to self score.

I have done it previously with two different vectors a and b. However, if I use this for the same vectors then "tom" #1 will score against "tom" #1 which is what I want to avoid.

Is there a way to do this?

Sébastien Rochette
  • 6,536
  • 2
  • 22
  • 43
Rtab
  • 123
  • 10

1 Answers1

0

You can use combn to generate all unordered pairs of elements of a:

a <- c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell")

df <- data.frame(t(combn(a, 2)), stringsAsFactors = FALSE)
df$sim <- RecordLinkage::levenshteinSim(df$X1, df$X2)

head(df)
#    X1     X2 sim
# 1 tom  tommy 0.6
# 2 tom   alex 0.0
# 3 tom    tom 1.0
# 4 tom alexis 0.0
# 5 tom   Alex 0.0
# 6 tom  jenny 0.0
Scarabee
  • 5,437
  • 5
  • 29
  • 55
  • With larger sets I receive an error - Is there a way using combn to set a key (based on another variable) so that only combinations with matching keys are returned? – Rtab Aug 22 '17 at 16:33