How to quickly search a large vector many times?

Question

I have a std::vector<std::string> that has 43,000 dictionary words. I have about 315,000 maybe-words and for each one I need to determine if it's a valid word or not. This takes a few seconds and I need to complete the task as fast as possible.

Any ideas on the best way to complete this? Currently I iterate through on each attempt:

for (std::string word : words) {
    if (!(std::find(dictionary.begin(), dictionary.end(), word) != dictionary.end())) {
        // The word is not the dictionary
        return false;
    }
}
return true;

Is there a better way to iterate multiple times? I have a few hypothesis, such as

Create a cache of invalid words, since the 315,000 list probably has 25% duplicates
Only compare with words of the same length

Is there a better way to do this? I'm interested in an algorithm or idea.

Maybe you want to use a `std::set` or `std::unordered_set` instead of a vector. — drescherjm, Nov 22 '21 at 15:29
Put the values in `std::unordered_set` then they will be unique plus have O(1) lookup on average. — Cory Kramer, Nov 22 '21 at 15:29
Probably should use `set` or `unordered_set` instead of `vector`. — silverfox, Nov 22 '21 at 15:29
Can you sort your `vector` (for binary search)? Or use another container ([`std::unordered_set`](https://en.cppreference.com/w/cpp/container/unordered_set), [trie](https://en.wikipedia.org/wiki/Trie), ...) — Jarod42, Nov 22 '21 at 15:29
Sort the vector and then do a binary search, or use `std::set`, or use `std::unordered_set`. — molbdnilo, Nov 22 '21 at 15:30
Also, `for (const std::string& word : words)`. You don't want to copy 315000 strings. — molbdnilo, Nov 22 '21 at 15:31
Can you use a **sorted** `std::vector`? A binary search in a sorted std::vector will probably be good performance. Even better is to sort **both** of them, and then iterate through both vectors and mark the maybe-words as good-or-bad in some sort of smart single-iteration through both sorted lists. — Eljay, Nov 22 '21 at 15:31
Investigate this for your vector: [std::sort](https://en.cppreference.com/w/cpp/algorithm/sort), [std::lower_bound](https://en.cppreference.com/w/cpp/algorithm/lower_bound), [std::upper_bound](https://en.cppreference.com/w/cpp/algorithm/upper_bound) — PaulMcKenzie, Nov 22 '21 at 15:37
I'd expect a trie as proposed by @Jarod42 being fastest, followed by hashing (`std::unordered_set`). You might want to profile both approaches to make a final decision... — Aconcagua, Nov 22 '21 at 15:53
May I ask where those 315,000 maybe-words come from? Are you generating those from a single word perhaps? You may be able to do something more efficient, e.g. by avoiding actually storing those maybe-words. — cigien, Nov 22 '21 at 18:13
Consider using a "perfect hash" if the dictionary is static. This will give the fastest lookup and you only need to verify the string. Use it in the `std::unordered_set` per suggestions and answer. https://stackoverflow.com/questions/27694153/perfect-hash-function-for-strings-known-in-advance — doug, Nov 22 '21 at 19:13

eerorika · Accepted Answer · 2021-11-22T16:08:30.200

3

Is there a better way to iterate multiple times?

Yes. Convert the vector to another data structure that supports faster lookups. The standard library comes with std::set and std::unordered_set which are both likely to be faster than repeated linear search. Other data structures may be even more efficient.

If your goal is to create a range of words or non-words in the maybe set, then another efficient approach would be to sort both vectors, and use std::(ranges::)set_intersection or std::(ranges::)set_difference.

edited Nov 22 '21 at 16:08

answered Nov 22 '21 at 16:02

eerorika

232,697
12
197
326

Good answer. As an aside, if you wanted to apply something to each matching, or not matching string from the second paragraph, you could create a custom insert_operator.. – doug Nov 22 '21 at 22:26

How to quickly search a large vector many times?

1 Answers1