1

I have a std::vector<std::string> that has 43,000 dictionary words. I have about 315,000 maybe-words and for each one I need to determine if it's a valid word or not. This takes a few seconds and I need to complete the task as fast as possible.

Any ideas on the best way to complete this? Currently I iterate through on each attempt:

for (std::string word : words) {
    if (!(std::find(dictionary.begin(), dictionary.end(), word) != dictionary.end())) {
        // The word is not the dictionary
        return false;
    }
}
return true;

Is there a better way to iterate multiple times? I have a few hypothesis, such as

  • Create a cache of invalid words, since the 315,000 list probably has 25% duplicates
  • Only compare with words of the same length

Is there a better way to do this? I'm interested in an algorithm or idea.

CJ Coding
  • 13
  • 2
  • 3
    Maybe you want to use a `std::set` or `std::unordered_set` instead of a vector. – drescherjm Nov 22 '21 at 15:29
  • 5
    Put the values in `std::unordered_set` then they will be unique plus have O(1) lookup on average. – Cory Kramer Nov 22 '21 at 15:29
  • 2
    Probably should use `set` or `unordered_set` instead of `vector`. – silverfox Nov 22 '21 at 15:29
  • 5
    Can you sort your `vector` (for binary search)? Or use another container ([`std::unordered_set`](https://en.cppreference.com/w/cpp/container/unordered_set), [trie](https://en.wikipedia.org/wiki/Trie), ...) – Jarod42 Nov 22 '21 at 15:29
  • 1
    Sort the vector and then do a binary search, or use `std::set`, or use `std::unordered_set`. – molbdnilo Nov 22 '21 at 15:30
  • 1
    Also, `for (const std::string& word : words)`. You don't want to copy 315000 strings. – molbdnilo Nov 22 '21 at 15:31
  • 2
    Can you use a **sorted** `std::vector`? A binary search in a sorted std::vector will probably be good performance. Even better is to sort **both** of them, and then iterate through both vectors and mark the maybe-words as good-or-bad in some sort of smart single-iteration through both sorted lists. – Eljay Nov 22 '21 at 15:31
  • 1
    Investigate this for your vector: [std::sort](https://en.cppreference.com/w/cpp/algorithm/sort), [std::lower_bound](https://en.cppreference.com/w/cpp/algorithm/lower_bound), [std::upper_bound](https://en.cppreference.com/w/cpp/algorithm/upper_bound) – PaulMcKenzie Nov 22 '21 at 15:37
  • I'd expect a trie as proposed by @Jarod42 being fastest, followed by hashing (`std::unordered_set`). You might want to profile both approaches to make a final decision... – Aconcagua Nov 22 '21 at 15:53
  • May I ask where those 315,000 maybe-words come from? Are you generating those from a single word perhaps? You may be able to do something more efficient, e.g. by avoiding actually storing those maybe-words. – cigien Nov 22 '21 at 18:13
  • 1
    Consider using a "perfect hash" if the dictionary is static. This will give the fastest lookup and you only need to verify the string. Use it in the `std::unordered_set` per suggestions and answer. https://stackoverflow.com/questions/27694153/perfect-hash-function-for-strings-known-in-advance – doug Nov 22 '21 at 19:13

1 Answers1

3

Is there a better way to iterate multiple times?

Yes. Convert the vector to another data structure that supports faster lookups. The standard library comes with std::set and std::unordered_set which are both likely to be faster than repeated linear search. Other data structures may be even more efficient.

If your goal is to create a range of words or non-words in the maybe set, then another efficient approach would be to sort both vectors, and use std::(ranges::)set_intersection or std::(ranges::)set_difference.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Good answer. As an aside, if you wanted to apply something to each matching, or not matching string from the second paragraph, you could create a custom insert_operator.. – doug Nov 22 '21 at 22:26