Find equivalent words in a list through arbitrary mapping

Question

Imagine you have a list of words:

['cat', 'ant', 'bro', 'gro']

Using some arbitrary mapping that we construct ourselves {'c'=>'a', 'a'=>'n', 't'=>'t' }, we can map 'cat' to 'ant', and similarly we can find some arbitrary mapping to convert 'bro' to 'gro'.

This is the idea of finding words that are equivalent. I wrote a function that compares two words, and checks if they are equivalent through a mapping I construct on the fly:

def compareWords(w1, w2):
   mapping = {}
   for i in xrange(0, len(w1)):
       if w1[i] in map:
           if mapping[w1[i]] == w2[i]:
               continue
           else:
               return False
        else:
           mapping[w1[i]] = w2[i]

    return True

Example input:

['cat', 'ant', 'bro', 'gro']

Example output:

[['cat','ant'], ['bro', 'gro']]

Using this function on each pair of words in the list to return an output list of lists of equivalent words runs in O(n^2) since every pair will need to be compared. I haven't implemented this part where I use this method above on the input list and generate the output list, since this method isn't the efficient comparison I am looking for. Is there a way to find equivalent words in this input list in O(n) time?

Further Explanation: If I have a list of words, and I want to find all the "equivalent" words, I would need to make the checks in pairs of words. If all letters of the words I am comparing are unique, then another word in the list is only equivalent to this first word if all the letters in the second word are also unique. so abc can be mapped to xyz if xyz is in the list. xyz can be mapped to pqr if xyz is in the list. Given this, abc, xyz, and pqr are all equivalent. This is the kind of grouping I am looking for.

are you supposed to generate the `map` or is it predefined? Why are you reinitializing it as an empty dict. Also maybe call it `mapping` as `map` clashes with the built in function. — Bahrom, Mar 23 '16 at 17:27
I do not follow what is the input and output of the algorithm. You are looking for some mapping that according to it you can map any word to any word? If it does not have to be minimal: `return {a,...,z}X{a,...,z}`. If it does have to be minimal, your algorithm is incorrect. — amit, Mar 23 '16 at 17:27
If I have a list of words, and I want to find all the "equivalent" words, I would need to make the checks in pairs of words. If all letters of the word @niklas I am comparing are unique, then another word in the list is only equivalent to this first word if all the letters in the second word are also unique. so abc can be mapped to xyz if xyz is in the list. xyz can be mapped to pqr if xyz is in the list. Given this, abc, xyz, and pqr are all equivalent. This is the kind of grouping I am looking for. — robinhood91, Mar 23 '16 at 19:35
so the answer below is correct? if not give us an example input and example output please — niklas, Mar 23 '16 at 19:45

georg · Accepted Answer · 2016-03-23T19:49:00.850

If I understood correctly, you're looking for a way to check if an arbitrary relation, given as a list of pairs x,y is a function, that is, x1==x2 implies y1==y2. This can be done easily with sets:

def is_function(rel):
    return len(set(rel)) == len(set(x for x, y in rel))


print is_function(['ab', 'cd', 'xd']) # yes
print is_function(['ab', 'cd', 'ad']) # no

Two words are "equivalent" in the sense of your question if their letter-to-letter relation is a function:

def equivalent(a, b):
    return is_function(zip(a, b))

print equivalent('foo', 'baa') # yes
print equivalent('foo', 'bar') # no

If you consider equivalences between different words as different relations, there's no way you can avoid comparing each one with each. Moreover, your "equivalence" isn't even commutative, A ~ B doesn't imply B ~ A (e.g abc ~ xyx, but xyx !~ abc).

From your comment, your relation turns out to be bijective (note: your code is not correct for this case). The simplest (not necessarily the most efficient) way to split the list into "equivalence classes" would be to calculate a "hash" for each word, replacing letters with numbers, where 0 stands for the first letter, 1 for the second etc:

def eq_hash(word):
    return tuple(word.index(c) for c in word)

print eq_hash('mom') # 0 1 0
print eq_hash('dad') # 0 1 0

Now, you can group together all words that have the same "hash". These will be equivalent in the context of your question:

group = {}

words = ['mom', 'dad', 'aaa', 'bob', 'ccc', 'xyz', 'abc']

for w in words:
    h = eq_hash(w)
    group[h] = group.get(h, []) + [w]

print group.values()
# [['xyz', 'abc'], ['mom', 'dad', 'bob'], ['aaa', 'ccc']]

@georg : in my relation- > abc ~ xyz (because abc are unique and xyz are unique) , not xyx because each letter can only be mapped to a unique letter in the second word. so if mom is equivalent to dad and dad is equivalent to bab, then mom is equivalent to bab (assuming bab is a word). So this equivalence relation is associative. — robinhood91, Mar 23 '16 at 19:31
@georg This solution indeed results in the correct answer. It's beautiful, thanks! — robinhood91, Mar 23 '16 at 20:15

score 0 · Answer 2 · answered Mar 23 '16 at 19:53

If I understand your request, you want to group words such that every relation could be unique, not necessarily that it is unique. Using your examples, mom ~ dad ~ bab, but bad couldn't exist within that set since no mapping that can map from mom to dad (m->d, o->a) or dad to bab (d->b, a->a) can map to bad (for mom, m->b AND d? for dad, d to b once and skip the next?).

Assuming that your grouping is really commutative, then as soon as you have grouped a word, you should never have to revisit it, except to check against the first word of each group.

So, you would start by adding your first word to your first group. Then, for each additional word, you would need to compare it with the first word in each existing group - if it matches, you add it to the group; if it doesn't match any groups, you add it to a new group of its own.

In the worst case, this is O(N**2), if every word in your set belongs in its own group. In the best case, if all words in your set end up in the first group, it would be O(N), since you would only compare the first word in the only group to each additional word. If you have a log(N) distribution of sets, this algorithm is effectively O(N log(N)). So, it depends on your input set, but it will result in far fewer comparisons than if you check every pair.

Find equivalent words in a list through arbitrary mapping

2 Answers2