5

I have been given an exercise about anagrams and it looked so really easy that I have a doubt I am missing something. The solution I implemented is the one I will present shortly, and I wanted to ask you if you could think of any optimization, change of approach or problem with my solution. I implemented the algorithm in Java.

Now, the exercise. As input I have a text and as output I should return whether each line of this text is an anagram of each other line. That is, for input:

A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town

The program should return True. For input:

A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls hi
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town

the output will have to be False (because of the second line, of course).

Now, what I thought is pretty straightforward:

  • I create 2 HashMap: ref and cur.
  • I parse the first line of the text, filling ref. I will count only alphabetical letters.
  • for each other line, I parse the line into cur and check if cur.equals(ref): if so return false
  • if I get to the end of the text, it means that each line is an anagram of each other line, so I return true.

And...this would be it. I tried it with an input text of 88000 lines, and it works pretty fast.

Any comments? Suggestions? Optimizations?

Thank you very much for the help.

mdm
  • 3,928
  • 3
  • 27
  • 43

3 Answers3

5

Another option is:

  1. Strip all characters you don't care about from the string (punctuation, whitespace)
  2. Make it lowercase
  3. Sort the string
  4. Compare to the reference string (with .equals)

I suspect your way is faster though.

EDIT:

Since @nibot disagrees with my even suggesting this, and I'm not one to argue back and forth without proof, here's three solutions.

They're all implemented very similarly:

  1. Convert line to lowercase
  2. Ignore non-alphabetic characters
  3. ?
  4. Check of the result of 3. matches the result from the first line

The ? part is one of:

  • Make a HashMap of character counts
  • Sorting the characters
  • Making a 26-int array (the ultimate hash table solution, but only works for the Latin alphabet)

I ran them all with this:

public static void time(String name, int repetitions, Function function,
        int expectedResult) throws Exception {
    long total = 0;
    for (int i = 0; i < repetitions; i++) {
        System.gc();
        long start = System.currentTimeMillis();
        int result = function.call();
        long end = System.currentTimeMillis();
        if (result != expectedResult) {
            System.out.println("Oops, " + name + " is broken");
            return;
        }
        total += end - start;
    }
    System.out.println("Executution of " + name + " took "
            + (total / repetitions) + " ms on average");
}

My file is similar to the one the OP posted, but made significantly longer, with a non-anagram about 20 lines from the end to ensure that the algorithms all work.

I consistently get results like this:

Execution of testWithHashMap took 158 ms on average
Execution of testWithSorting took 76 ms on average
Execution of testWithArray took 56 ms on average

The HashMap one could be significantly improved if:

But, these aren't in the standard library, so I'm ignoring them (just like most programmers using Java would).

The moral of the story is that big O isn't everything. You need to consider the overhead and the size of n. In this case, n is fairly small, and the overhead of a HashMap is significant. With longer lines, that would likely change, but unfortunately I don't feel like figuring out where the break-even point is.

And if you still don't believe me, consider that GCC uses insertion sort in some cases in its C++ standard library.

Brendan Long
  • 53,280
  • 21
  • 146
  • 188
  • Sorting can only be slower than the obvious O(n) algorithm. – nibot Oct 04 '11 at 00:05
  • 2
    @nibot - I don't see what the downvote is for. They wanted to know other options, and this is another option. This option is pretty much guaranteed to use less memory, and depending on the length of the string and your hashing function, it can be faster too. Big O isn't everything. – Brendan Long Oct 04 '11 at 00:08
  • I'd say this is the best solution because its very easy and it requires by far the least amount of memory :) @nibot: You make your life a little easy... We're not sorting lines, we're sorting each line. This corresponds to mapping strings to equivalency classes w.r.t. permutations -> Anagramms. Also, since the alphabet is quite finite you can have the generally beloved linear time by using radix-sort or simple binning. – Michael Nett Oct 04 '11 at 03:12
  • @Stephen C - I call `System.gc()` **right before** I start the time. The only thing inside the "timer" (`start` to `end`) is `function.call()` (and probably some overhead for calling `System.currentTimeMillis()`). – Brendan Long Oct 04 '11 at 15:48
  • But the flip side is that you are removing the (possible) overheads / effects of GC from the algorithm entirely. That is unrealistic too. The correct thing to do is to take the System.gc() call out. Then run the test with a large number of repetitions, discard the first few (to deal with "JVM warmup" effects) and then take the average to even out the GC "lumps". In short, this is still an unreliable benchmark. – Stephen C Oct 05 '11 at 01:22
  • @Stephen C - All benchmarks contain trade-offs. Mine makes the tests repeatable at the expense ignoring memory usage. Since we're dealing with data structures that take around a hundred bytes of memory, there's a good chance this program wouldn't trigger a garbage collect anyway. If you doubt my results, feel free to fix it and show me yours. – Brendan Long Oct 05 '11 at 02:34
3

Assuming that your HashMap is a mapping from (character) -> (number of occurrences in the string), you pretty much have it.

I assume that you're supposed to ignore whitespace and punctuation, and treat uppercase and lowercase letters as the same. If you're not using any languages other than English, then a HashMap is overkill: you can simply use an array of 26 counts representing A..Z. If you need to support Unicode then the problem is of course much more complicated, as not only do you need to deal with possibly thousands of different kinds of letters, but you also have to define 'letter' (fortunately there exists character property data that helps with this) and 'lowercase/uppercase' (note that some languages don't have case, some can map two lowercase letters into a single uppercase one or vice-versa...). To say nothing of normalization :)

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • Well, I thought about the array of 26 characters, but as you mentioned, the hashmap implementation was more "extendible", in case it was need to add character besides english alphabet. But I don't know if that really makes sense....I'll think about it! Thanks! – mdm Oct 04 '11 at 00:03
2

Building in @Karl Knechtel's answer (and addressing your concern about supporting multiple alphabets):

  • Create interfaces (say) AnagramKey and AnagramKeyFactory. Design the rest of the application to be agnostic of the type of key used.

  • Create one implementation of the AnagramKey interface that internally uses an int[] to represent the character counts.

  • Create a second implementation of the AnagramKey interface that uses a HashMap<Character, Integer> to represent the character counts.

  • Create the corresponding factory interfaces.

  • Choose between the two ways of representing the keys using a command line parameter, the Locale, or something else.

Notes:

  1. It is not clear that "anagrams" make sense in the context of non-alphabetic languages, or for utterances that mix multiple languages into a "sentence". Also, I don't know whether anagrams in (say) French ignore the accents on characters. At any rate, I would be tempted to rule all of these cases as "out of scope" ... unless you have an explicit requirement to support them.

  2. The break-even density at which an int[] uses less space than a HashMap<Character, Integer> is asymptotically around 1 character in 15 across the range of characters in your count array. (Each entry in a HashMap with those key/value types occupies in the region of 15 32-bit words.) And that doesn't take account of the overheads of the HashMap node and the hash array node ...

  3. If you place limits on the length of the anagrams, you can save more space by using a short[] or even a byte[] for the character counts.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216