An algorithm
We are going to find an "identifier" for every class of anagrams. An identifier should be something that:
- is unique to this class: no two classes have the same identifier;
- can be computed when we're given a single word of the class: given two different words of the same class, we should compute the same identifier.
Once we've done that, all we have to do is group together the words that have the same identifier. There are several different ways of grouping words that have the same identifier; the main two ways are:
- sorting the list of words, using the identifier as a comparison key;
- using a "map" data structure, for instance a hash table or a binary tree.
Can you think of a good identifier?
An identifier I can think of is the list of letters of the words, in alphabetical order. For instance:
comedian --> acdeimno
dog --> dgo
god --> dgo
hello --> ehllo
hole --> ehlo
demoniac --> acdeimno
Implementation in python
words = 'comedian dog god hello hole demoniac'.split()
d = {}
for word in words:
d.setdefault(''.join(sorted(word)), []).append(word)
print(list(d.values()))
[['comedian', 'demoniac'], ['dog', 'god'], ['hello'], ['hole']]
The explanation
The most important thing here is that for each word, we computed ''.join(sorted(word))
. That's the identifier I mentioned earlier. In fact, I didn't write the earlier example by hand; I printed it with python using the following code:
for word in words:
print(word, ' --> ', ''.join(sorted(word)))
comedian --> acdeimno
dog --> dgo
god --> dgo
hello --> ehllo
hole --> ehlo
demoniac --> acdeimno
So what is this? For each class of anagrams, we've made up a unique word to represent that class. "comedian"
and "demoniac"
both belong to the same class, represented by "acdeimno"
.
Once we've managed to do that, all that is left is to group the words which have the same representative. There are a few different ways to do that. In the python code, I have used a python dict
, a dictionary, which is effectively a hashtable mapping the representative to the list of corresponding words.
Another way, if you don't know about map data structures, is to sort the list, which takes O(N log N) operations, using the representative as the comparison key:
print( sorted(words, key=lambda word:''.join(sorted(word))) )
['comedian', 'demoniac', 'dog', 'god', 'hello', 'hole']
Now, all words that belong to the same class of synonyms are adjacent. All that's left for you to do is iterate through this sorted list, and group elements which have the same key. This is only O(N). So the longest part of the algorithm was sorting the list.