I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.
By default, the distance in editing a variable (add/sub/change) is either 1 or 0. I'd like to change the distance depending on the variable and what value I set for that variable.
So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.
Here is my previous post as context: Modify Levenshtein-Distance to ignore order
Example:
# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}
# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}
>>> distance('woke', 'woc')
0.8
Explanation:
woke
k -> c = 1
woce
-e = 1
woc
Distance = 2
# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8
How could I achieve this? Would this be possible to implement with this Counter algorithm?
Current code (thank you David Eisenstat)
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2