Fast string search?

Question

I have a vector of strings and have to check if each element in vector is present in a given list of 5000 words. Besides the mundane method of two nested loops, is there any faster way to do this in C++?

Is it an option to populate an associative container in the first place rather than a list? — Andy Prowl, Feb 05 '13 at 21:10
Is it a possibility to sort the list of 5000 words? If yes, then on the sorted list you can binary search for the strings in the vector. — Satyajit, Feb 05 '13 at 21:12
Do you want the string to match the *entirety* of one in your set, or is it sufficient that the one in the set *contain* the one you're looking for? — Jerry Coffin, Feb 05 '13 at 21:12
sure, i am just a beginner in c++ and familiar with vectors only. what do you suggest? — ofey, Feb 05 '13 at 21:13
@JerryCoffin: i am supposed to print all words in the vector which are found in the 5000 word dictionary. — ofey, Feb 05 '13 at 21:14

Philipp · Accepted Answer · 2013-02-05T21:18:53.090

11

You should put the list of strings into an std::set. It's a data structure optimized for searching. Finding if a given element is in the set or not is an operation which is much faster than iterating all entries.

When you are already using C++11, you can also use the std::unordered_set which is even faster for lookup, because it's implemented as a hash table.

Should this be for school/university: Be prepared to explain how these data structures manage to be faster. When your instructor asks you to explain why you used them, "some guys on the internet told me" is unlikely to earn you a sticker in the class book.

edited Feb 05 '13 at 21:18

answered Feb 05 '13 at 21:11

Philipp

67,764
9
118
153

haha, no, would have mentioned it if this was for school. this was part of my code for a usaco problem. – ofey Feb 05 '13 at 21:20

score 3 · Answer 2 · answered Feb 05 '13 at 21:13

3

You could put the list of words in an std::unordered_set. Then, for each element in the vector, you just have to test if it is in the unordered_set in O(1). You would have an expected complexity of O(n) (look at the comment to see why it is only expected).

answered Feb 05 '13 at 21:13

Baptiste Wicht

7,472
7
45
110

2

That's not quite the truth. The hash of each string has to be calculated, and the strings have to be compared at least once. Each of those is independent of the total number of strings (in the expected case), but it's worth mentioning. And while the worst case is extremely unlikely, it's good style to remain correct and say that the *expected* time is O(1). – Feb 05 '13 at 21:15
1

You're completely right. I changed my answer in consequence. Thank you. – Baptiste Wicht Feb 05 '13 at 21:16

score 2 · Answer 3 · answered Feb 05 '13 at 21:23

2

You could sort the vector, then you can solve this with one "loop" (taken that your dictionary is sorted too) which means O(n) not counting in the cost of the sort.

answered Feb 05 '13 at 21:23

hege

987
4
15

score 2 · Answer 4 · answered Feb 05 '13 at 21:46

So you have a vector of strings, with each string having one or more words, and you have a vector that's a dictionary, and you're supposed to determine which words in the vector of strings are also in the dictionary? The vector of strings is an annoyance, since you need to look at each word. I'd start by creating a new vector, splitting each string into words, and pushing each word into the new vector. Then sort the new vector and run it through the std::unique algorithm to eliminate duplicates. Then sort the dictionary. Then run both ranges through std::set_intersection to write the result.

Fast string search?

4 Answers4