3

I am reading the contents of a txt file into a HashSet. The file contains almost every word in the English language, and each word becomes a String in the HashSet.

In my app, characters are being added to a String. I want to check whether this String is, or can become equal to, any of the Strings in the HashSet. That is, say the HashSet contains only the String apple. I have a String appl, and now I want to filter out the HashSet until it becomes a set with only Strings that start with appl (in this case a set with only apple).

I can iterate over the entire HashSet and use the startsWith(String) method, as I build a new filtered HashSet. But my initial HashSet is very big, so my question is: Is there a more efficient way to do this (perhaps using a different type of Collection?)

Some code of how I would do it right now:

private HashSet<String> filter(String partOfWord){
    HashSet<String> filteredSet = new HashSet<>();

    for (String word : dictionary) { // dictionary is the full HashSet
        if (word.startsWith(partOfWord)) {
            filteredSet.add(word);
        }
    }
    return filteredSet;
}
Chronicle
  • 1,565
  • 3
  • 22
  • 28
  • 5
    You might want to look at a Trie (https://en.wikipedia.org/wiki/Trie). There's nothing built-in in Java, but there are plenty of open source implementations. – Ori Lentz Sep 30 '15 at 12:58
  • 1
    There are predefined libraries which you can use to solve this problem using Trie datastructures. For reference, use the [link](http://stackoverflow.com/questions/3806788/trie-data-structures-java) – Santhosh Tangudu Sep 30 '15 at 13:01

1 Answers1

4

A trie is the ultimate weapon of doom for this task, but you can get good efficiency out of a TreeSet:

private TreeSet<String> dictionary;

private TreeSet<String> filter(String partOfWord) {
    return (TreeSet<String>)dictionary.subSet(partOfWord, partOfWord + "zzz");
}

Everything that start with "appl" is also between "appl" (inclusive if it's a word itself) and "applzzz" (no English word has 3 consecutive "z"'s in it), which is lexicographically greater than all words that start with "appl". The time complexity of the call to subset() is O(log n) to find the start of the subset and O(m) (m = number returned) for the range, which is pretty good.

Note that if you are able to reuse the returned set as your new dictionary as your word grows, you will have much more efficient code overall.

The cast to TreeSet<String> is needed because subSet() is a method of the SortedSet interface and returns a SortedSet, but it's covariant because the TreeSet implementation returns a view (another efficiency benefit), which is of course another TreeSet.

For improved efficiency, but uglier code, you could use a sorted String[] and Arrays.binarySearch(), then once you located your hit, you could quickly iterate along the array collection your hits.

Note that both the TreeSet and sorted array have O(log n) look-up time, whereas a HashSet (although unsuitable for the task) is O(1) look up time.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • If `startsWith` were to be a TreeSet as well, I could call this code again on the filtered sets, yes? – Chronicle Sep 30 '15 at 13:11
  • @Chronicle `startsWith()` is a `String` method - `TreeSet` doesn't have relevance to it. – Bohemian Sep 30 '15 at 13:16
  • I meant in your code, you created a new `Set` called `startsWith`. – Chronicle Sep 30 '15 at 13:16
  • 1
    @Chronicle ahh!. See edit. But unless you need to do something similar, any set would do. Although as your word length grows, you could limit your dictionary to the last filtered set returned, which would be very efficient :) – Bohemian Sep 30 '15 at 13:19
  • That's the idea! Excellent answer. By the way, you forgot to replace the variable `stem` in your code with `partOfWord` after your edits. – Chronicle Sep 30 '15 at 13:26
  • @Chronicle OK - thx. I would write two versions - one like the code posted, and another where you pass in the dictionary last returned. Also, note an even *better* version now!!! – Bohemian Sep 30 '15 at 13:33
  • +1, by the way, subset() returned a SortedSet not a TreeSet. Maybe I am missing something, but for the array based solution, if hit is not in the array, will it still work? – dragon66 Sep 30 '15 at 13:48
  • @dragon66 cast added to code. Note also discussion of why cast is needed and safe. – Bohemian Sep 30 '15 at 13:58
  • @Bohemian: thanks. Given the implementation detail of the TreeSet subset(), it is a safe cast to TreeSet but say it's covariant, I am a little bit suspicious unless you don't need the cast. – dragon66 Sep 30 '15 at 20:48