0

I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.

By default, the distance in editing a variable (add/sub/change) is either 1 or 0. I'd like to change the distance depending on the variable and what value I set for that variable.

So I can say certain variables are similar to other variables, and provide a value for how similar they are. I also want to say certain variables are worth less or more distance than usual.

Here is my previous post as context: Modify Levenshtein-Distance to ignore order

Example:

# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}

# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}

>>> distance('woke', 'woc')
0.8

Explanation:

woke
k -> c = 1
woce
-e = 1
woc
Distance = 2

# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8

How could I achieve this? Would this be possible to implement with this Counter algorithm?

Current code (thank you David Eisenstat)

def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2
Community
  • 1
  • 1
Luis
  • 85
  • 6

2 Answers2

0

Generally, this is the assignment problem. You have a set of characters from one string, a set of characters fro another string, and a set of costs for assigning a character from one string to a character in another string. You can add some dummy characters to both strings to treat the add/delete operations.

There are many known algorithms for the assignment problem. One of them is the Hungarian algorithm, the linked Wikipedia article contains links to some of the implementations. Other method is reducing the assignment problem to maxflow-mincost problem, which you can find simpler to adjust for add/delete operations.

Petr
  • 9,812
  • 1
  • 28
  • 52
0

I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.

Summing up what I was trying to achieve (in relation to Edit-distance algorithms)

  • Distance from one letter to another is 1. change j -> k = 1
  • 0 being no difference at all. e.g. change j -> j = 0
  • Similar letters can be worth less than 1 (specified by me) e.g. c and k sound the same, therefore c, k = 0.5, change c -> k = 0.5
  • Certain letters could be worth more or less (specified by me) e.g. x is uncommon so I want it to have more weight, x = 1.4, change x -> k = 1.4

Created 2 dictionaries, 1 for similar letters, 1 for exceptions

  1. Populate Counter - Iterate through both strings
  2. Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict
  3. Update Counter - remove similar items,
  4. Find Distance - add up absolute frequencies, account for difference in string length
  5. Include exceptions distance - Account for exception values based on frequency of letters
Luis
  • 85
  • 6