4

Given a set of strings (large set), and an input string, you need to find all the anagrams of the input string efficiently. What data structure will you use. And using that, how will you find the anagrams?

Things that I have thought of are these:

  1. Using maps

    a) eliminate all words with more/less letters than the input.

    b) put the input characters in map

    c) Traverse the map for each string and see if all letters are present with their count.

  2. Using Tries

    a) Put all strings which have the right number of characters into a trie.

    b) traverse each branch and go deeper if the letter is contained in the input.

    c) if leaf reached the word is an anagram

Can anyone find a better solution?

Are there any problems that you find in the above approaches?

vgru
  • 49,838
  • 16
  • 120
  • 201
Kshitij Banerjee
  • 1,678
  • 1
  • 19
  • 35
  • Let's say you have a candidate for an anagram. You could try sort both the input string and this string - they should be identical after sorting. Have you considered this approach? – user998692 Jan 23 '12 at 12:42
  • sorting would give me additional time consumption. while my above approach is linear without sorting – Kshitij Banerjee Jan 23 '12 at 12:46
  • Say the avg word length is in range [3, 20] chars... you do a very limited number of comparisons when sorting a word. Also, once you preprocessed the whole dictionary using an hashtable, then each subsequent call to getAnagrams would be O(1), while it's not true in the trie approach. – Savino Sguera Jan 23 '12 at 12:58
  • I dont see it.. With your approach, for each word you sort you take O(nlogn) on average.. when n is large.. logn is very large. on the contrary in the trie approach, you only check if each branch has the correct set of letters in O(n). So the trie would be faster isnt it. I do agree that if you've preprocessed the dictionary its o(1). but the question is dynamic. so the input and list of strings is given at runtime, so for each problem set you have to construct it again, hence it concerns the complete efficiency. – Kshitij Banerjee Jan 23 '12 at 13:07
  • Step 2b is incomplete. Say the input is slow. wool would also match. – MayankT Jul 14 '12 at 00:44

3 Answers3

5

Build a frequency-map from each word and compare these maps.

Pseudo code:

class Word

  string word
  map<char, int> frequency

  Word(string w)
    word = w
    for char in word
      int count = frequency.get(char)
      if count == null
        count = 0
      count++
      frequency.put(char, count)

  boolean is_anagram_of(that)
    return this.frequency == that.frequency 
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • That sounds good.!But is it faster than the trie approach ? Also, a trie would use lesser memory since some letters will be common. – Kshitij Banerjee Jan 23 '12 at 12:50
  • @KshitijBanerjee for a trie to work, you'd need to sort the characters and build the trie from those sorted chars (how else would you determine that `"mary"` and `"army"` are anagrams?) Sorting a word takes `O(n*log(n))` while building a hash-based map will take `O(n)`. – Bart Kiers Jan 23 '12 at 13:09
  • Why will i sort to make the trie? consider mary as the input word and a list which contains army. I create a hashmap on mary . now build a trie on all the words in the list. and traverse each branch. When the branch is army, i will reach the leaf successfully and i have a match. No sorting needed.. right ? – Kshitij Banerjee Jan 23 '12 at 13:15
  • One benefit of your approach though that i see is. In a trie i will only insert the word in a trie when the number of characters matches. So one o(n) to find the number of characters and then again a o(n) to see if all the characters match meaning o(2n). While your approach will only do it in o(n). – Kshitij Banerjee Jan 23 '12 at 13:18
  • Ah, I see. You're walking the trie differently than I expected. – Bart Kiers Jan 23 '12 at 13:19
  • Most languages have native support for hash-based maps: that's why I'd choose that over a trie. Granted, a trie may use less space, but who cares about a bit of RAM? :) – Bart Kiers Jan 23 '12 at 13:22
  • @KshitijBanerjee Like Bart mentioned , wouldn't you need to sort the trie? Otherwise how will you search in it? Like you said if it reaches the leaf node then its an anagram , it means the list is entered into the Trie after sorting. Please explain if I am getting it wrong. – h4ck3d Jul 19 '12 at 14:53
4

You could build an hashmap where the key is sorted(word), and the value is a list of all the words that, sorted, give the corresponding key:

private Map<String, List<String>> anagrams = new HashMap<String, List<String>>();

void buildIndex(){
    for(String word : words){
        String sortedWord = sortWord(word);
        if(!anagrams.containsKey(sortedWord)){
            anagrams.put(sortedWord, new ArrayList<String>());
        }
        anagrams.get(sortedWord).add(word);
    }
}

Then you just do a lookup for the sorted word in the hashmap you just built, and you'll have the list of all the anagrams.

Savino Sguera
  • 3,522
  • 21
  • 20
0
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/*
 *Program for Find Anagrams from Given A string of Arrays.
 *
 *Program's Maximum Time Complexity is O(n) + O(klogk), here k is the length of word.
 *
 * By removal of Sorting, Program's Complexity is O(n) 
 *  **/
public class FindAnagramsOptimized {
    public static void main(String[] args) {
        String[] words = { "gOd", "doG", "doll", "llod", "lold", "life", 
"sandesh", "101", "011", "110" };
        System.out.println(getAnaGram(words));
    }
    // Space Complexity O(n)
    // Time Complexity O(nLogn)
    static Set<String> getAnaGram(String[] allWords) {
        // Internal Data Structure for Keeping the Values
        class OriginalOccurence {
            int occurence;
            int index;
        }
        Map<String, OriginalOccurence> mapOfOccurence = new HashMap<>();
        int count = 0;
        // Loop Time Complexity is O(n)
    // Space Complexity O(K+2K), here K is unique words after sorting on a

    for (String word : allWords) {
        String key = sortedWord(word);

        if (key == null) {
            continue;
        }
        if (!mapOfOccurence.containsKey(key)) {
            OriginalOccurence original = new OriginalOccurence();
            original.index = count;
            original.occurence = 1;
            mapOfOccurence.put(key, original);
        } else {
            OriginalOccurence tempVar = mapOfOccurence.get(key);
            tempVar.occurence += 1;
            mapOfOccurence.put(key, tempVar);
        }
        count++;
    }

    Set<String> finalAnagrams = new HashSet<>();

    // Loop works in O(K), here K is unique words after sorting on
    // characters
    for (Map.Entry<String, OriginalOccurence> anaGramedWordList : mapOfOccurence.entrySet()) {
        if (anaGramedWordList.getValue().occurence > 1) {
            finalAnagrams.add(allWords[anaGramedWordList.getValue().index]);
        }
    }

    return finalAnagrams;
}

// Array Sort works in O(nLogn)
// Customized Sorting for only chracter's works in O(n) time.
private static String sortedWord(String word) {

    // int[] asciiArray = new int[word.length()];
    int[] asciiArrayOf26 = new int[26];
    // char[] lowerCaseCharacterArray = new char[word.length()];
    // int characterSequence = 0;
    // Ignore Case Logic written in lower level
    for (char character : word.toCharArray()) {
        if (character >= 97 && character <= 122) {
            // asciiArray[characterSequence] = character;
            if (asciiArrayOf26[character - 97] != 0) {
                asciiArrayOf26[character - 97] += 1;
            } else {
                asciiArrayOf26[character - 97] = 1;
            }
        } else if (character >= 65 && character <= 90) {
            // asciiArray[characterSequence] = character + 32;
            if (asciiArrayOf26[character + 32 - 97] != 0) {
                asciiArrayOf26[character + 32 - 97] += 1;
            } else {
                asciiArrayOf26[character + 32 - 97] = 1;
            }
        } else {
            return null;
        }

        // lowerCaseCharacterArray[characterSequence] = (char)
        // asciiArray[characterSequence];
        // characterSequence++;
    }
    // Arrays.sort(lowerCaseCharacterArray);

    StringBuilder sortedWord = new StringBuilder();
    int asciiToIndex = 0;
    // This Logic uses for reading the occurrences from array and copying
    // back into the character array
    for (int asciiValueOfCharacter : asciiArrayOf26) {
        if (asciiValueOfCharacter != 0) {
            if (asciiValueOfCharacter == 1) {
                sortedWord.append((char) (asciiToIndex + 97));
            } else {
                for (int i = 0; i < asciiValueOfCharacter; i++) {
                    sortedWord.append((char) (asciiToIndex + 97));
                }
            }
        }
        asciiToIndex++;
    }
    // return new String(lowerCaseCharacterArray);
    return sortedWord.toString();
}
}
BITSSANDESH
  • 1,025
  • 4
  • 13
  • 23