Weighted unordered string edit distance

Question

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.

Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?

To clarify, the problem would be solved by

def unordered_edit_distance(target, source):
    return min(edit_distance(target, source_perm) 
               for source_perm in permuations(source))

So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.

EDIT Make it clearer that operations are associated with different costs.

By "with different per-symbol costs", you mean that substituting a->b, a->c and b->c could all have different costs, correct? Or would two of those be guaranteed to have the same cost? — Bernhard Barker, Mar 12 '14 at 10:01

score 1 · Answer 1 · answered Mar 12 '14 at 09:17

1

Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets. Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.

answered Mar 12 '14 at 09:17

CiaPan

9,381
2
21
35

This assumes the costs of insertions, deletions, and substitions are the same. Take the strings 'abc' and 'xyz' which share no characters. One can be turned into the other with 3 substitution operations. However, if the cost of doing three deletes followed by three inserts is lower than doing three subtitutions, the optimal procedure takes six operations. – Anders Johannsen Mar 12 '14 at 09:48
Right, I missed the note about costs, sorry about that. Anyway you don't need to change the identical items, so it's only the choice of 'replace or delete-and-insert' (unless the costs depend on what specific item is being deleted or inserted...) – CiaPan Mar 12 '14 at 11:18
In my case the costs actually depend on both of the symbols being replaced, or the symbol that is deleted or inserted. In technical terms, I have cost functions sub_cost(x,y), ins_cost(x), del_cost(x). – Anders Johannsen Mar 12 '14 at 11:54
Then you're in trouble. For transforming {A B} to {P Q} you can't tell in advance, without adding respective costs, whether it's cheaper to replace A with P anf B with Q or replace A with Q and B with P. I'm afraid you would have to check every possible pairing of items of both sets to find the optimum way to transform one set to the other. That means approx. `n!` assignments to check, where `n` is the size of the smaller set and `n!` is the number of permutations of its elements. – CiaPan Mar 12 '14 at 20:07
Additionally, if there is no separate guarantee that the cost function is given in its minimum possible values, you should search for the best way to achieve each possible change. For example for 'A to Q' replacement verify if it is cheapest to replace A-Q directly, or first delete A, then insert Q, or maybe do some chain of replacement, say A with C and C with P, then delete P, insert V and finally replace V with Q. Similarly instead of deleting X it might be cheaper to replace it with M and then delete M. And so on. – CiaPan Mar 12 '14 at 20:08

score 1 · Answer 2 · answered Mar 12 '14 at 09:20

1

Although your observation is kind of correct, but you are actually make a simple problem more complex.

Since source can be any permutation of the original source, you first need check the difference in character level.

Have two map each map count the number of individual characters in your target and source string:

for example: a: 2 c: 1 d: 100

Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.

answered Mar 12 '14 at 09:20

Leo

335
2
11

The problem is that there is more than one way of editing a source string to get the target string. For instance, how do I figure out whether to do a substitution to rewrite a specific symbol rather than a deletion and an insertion? – Anders Johannsen Mar 12 '14 at 09:53
thats easy: because your choice is very very limited, I think you are still thinking in string edit distance. But your question is very different from it. Lets say your target is abc, your source is xya. You count the character, a is same, ignore , now you need x and y. How you do it, you either insert it , or change b and c to x y. So its just a simple comparison of your insertion + deletion cost to your substitution cost. – Leo Mar 12 '14 at 10:00
I see what you're getting at. But in your example wouldn't I still have to list all of the different ways of rewriting xya to abc to compare the costs? And there could be many. Say it's cheaper to substitute a for b than to just insert b. Then the cheapest sequence of operations might be to substitute a for b, delete x and y, and insert a and c. – Anders Johannsen Mar 12 '14 at 11:48
OK,so you mean the cost of a -> b , a -> c may different ? First you should edit your question to explicitly say that substitution cost is different depends on the character, most people would assume substitution cost will be same. And what about insertion and deletion ? Are they same for any character or depends ? – Leo Mar 12 '14 at 19:55

score 1 · Answer 3 · answered Mar 12 '14 at 11:56

Let's ignore substitutions for a moment.

Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:

Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion

Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:

The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:

Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and

B also prefers A over the element to which B is already matched

Which can be solved with the Gale-Shapley algorithm:

The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).

We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.

Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.

Gassa · Answer 4 · 2014-03-12T09:44:59.753

Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.

So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.

Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).

Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.

Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

*between two unordered collections of symbols*, hence it is not only letters, but technically you are right if you know what kind of symbols you use you can allocate a suitable sized matrix and access each entrance by `Matr[symbol]`. I would suggest to generalize your answer. — Alexandru Barbarosie, Mar 12 '14 at 09:35
It's mentioned in the first sentence. Clarified a bit nevertheless. Thank you. — Gassa, Mar 12 '14 at 09:45

Weighted unordered string edit distance

4 Answers4