2

I have the following problem. I have to store a list of unique words in multiple languages in memory and of course when I add new words I have to check whether the new word already exist.

Of course this needs to be blazingly fast, primarily because of the huge number of words.

I was thinking about implementing a Suffix Tree, but I wondered whether there is an easier approach with some already implemented internal structures.

P.S. Number of words ≈ 107.

Itay Grudev
  • 7,055
  • 4
  • 54
  • 86
  • If you don't need to look for suffix, trie will be enough, no need for suffix tree. Also, HashSet can give you pretty good performance for less effort, benchmark and see if it's enough for you before implementing something more sophisticated. (But make sure this is moddable, so if you discover in a few months hash set is not enough - you could switch it without rewriting the entire project) – amit Aug 31 '14 at 10:35
  • 1
    To put things into perspective, checking if a HashSet with 10^7 items contains a certain word takes around ~250ns on average (insertion is around ~300ns, so more or less the same), on my i7 laptop. This means you can do 4 millions of such operations per second, on a single thread. – vgru Sep 11 '14 at 07:56
  • I ended up using a Dictionary. It's implemented around a HashSet, so the performance was good (almost the same as a HashSet). – Itay Grudev Sep 11 '14 at 08:20

1 Answers1

2

First, note that Suffix Trees might be an overkill here, since they allow fast search for any suffix of any word, which might be a bit too much than what you are looking for. A trie is a very similar DS, that also allows fast search for a word, but since it does not support fast search for any suffix - its creation is simpler (both to program and efficiency).

Another simpler alternative is using a simple hash table, which is implemented in C# as a HashSet. While a HashSet is on theory slower on worst case - the average case for each lookup takes constant time, and it might be enough for your application.

My suggestion is:

  1. First try using a HashSet, which requires much less effort to implement, benchmark it and check if it is enough.
  2. Make sure your DS is moddable, so you can switch it with very little effort if you later decide to. This is usually done by introducing an interface that is responsible to the addition and lookup of words, and if you need it changed - just introduce a different implementation to the interface.
  3. If you do decide to add suffix tree or trie - use community resources, no need to reinvent the wheel - someone has already implemented most of these data structures, and they are available online.
amit
  • 175,853
  • 27
  • 231
  • 333