Search a string for any occurence of a word in a list of strings

Question

I want to know how, in C++, to search a string for the first instance of ANY of a list of strings. A kind of full-word version of std::string::find_first_of(): "Searches the string for the first character that matches any of the characters specified in its arguments".

I want something that will search the string for the first WORD that matches any of the words in a provided list/array. To be clear, I don't want to search an array for an instance of a string. I want to search a string, for an instance of something in an array.

My goal is to be able to take a sentence, and remove all words that I have in a list. For example, if I give it the list {"the" "brown", "over"}; and the sentence, "the quick brown fox jumped over the lazy dog", I want it to output, " quick fox jumped lazy dog". And I want to be able to give it a list of even 100 words if I want; I need this to be expandable.

The only solution I can think of is to use std::find(stringArray[0]) in a while loop on my block of text, and save the indexes that that word is found at, then put all that in another for loop and do that on every single word in my array, saving the indexes of each word into one huge list. Optionally then numerically sorting that list, and finally then going through and removing each word that is at a position in that list.

I'm really hoping there's a function or an easier way to do it, because my solution seems difficult and VERY slow, especially since I need to use it many times, on many different strings, to go through all the sentences of a 50,000 character block of text. Anything better optimized would be preferred.

When I want to use or search a string, I look up the methods available in the `std::string` class. I found this [find_first_of`](http://en.cppreference.com/w/cpp/string/basic_string/find_first_of) function which looks promising. — Thomas Matthews, Jan 10 '16 at 02:03
Please edit your question with your code attempt. We can then assist you with it. — Thomas Matthews, Jan 10 '16 at 02:04

Christophe · Accepted Answer · 2016-01-10T02:57:24.583

If you look for standard functions, there are somme possiblities if you dare to store your sentences as a container of strings:

string input="Hello, world ! I whish you all \na happy new year 2016 !";
vector<string> sentence; 

stringstream sst(input);    // split the string into its pieces 
string tmp; 
while (sst>>tmp) 
    sentence.push_back(tmp);

Of course, in the real world you would do the split not just on whitespaces, but also on punctuations. This is just a proof of concept.

Once you have it in this form, it's easy to use the <algorithm> form of find_first_of():

vector<string> search{"We", "You", "I"}; 
auto it =  find_first_of(sentence.begin(), sentence.end(), 
                           search.begin(), search.end()); 

                           // display remaining of the sentence
copy(it , sentence.end(), ostream_iterator<string>(cout,"/"));    
cout<<endl;

And deleting words from a vector shouldn't then be anymore an issue. I let it to you as an exercise.

Once you have your cleaned vector you can rebuild a string:

stringstream so;
copy(it , sentence.end(), ostream_iterator<string>(so," ")); 
string result = so.str();

Here an online demo.

This solution won't address all your performance issues however. For this you need to analyse further where your performance bottleneck comes from: do you make a lot of unnecessary copies of objects ? Is it that your own algorithm triggers a lot of inefficient memory allocations ? Or is it really the sheer volume of text ?

Some ideas for further work:

build an alphabetical index to the words in the sentence (map> where the unsigned
consider a trie data structure (trie and not tree !!)
Use regular expressions in <regex>

Thank you, I'll try these suggestions. Very helpful! – Kyle B Jan 11 '16 at 14:25 — Kyle B, Jan 11 '16 at 14:25

score 1 · Answer 2 · answered Jan 10 '16 at 13:39

Some people's fast is other people's slow, so it is hard to say which fast you mean and 50000 characters doesn't sound so large, that one must do something extraordinary.

The only thing, that should be avoided is manipulating the input string in-place (would result in O(n^2) running time) - just return a new resulting string. It's probably wise to reserve enough memory for the resulting string, because it would save a constant factor for some inputs.

There is my proposal:

std::string remove_words(const std::string &sentence, const std::set<std::string> &words2remove, const std::string &delimiters){

    std::string result;
    result.reserve(sentence.size());//ensure there is enough place 

    std::string lastDelimiter;//no delimiter so far...
    size_t cur_position=0;
    while(true){
      size_t next=sentence.find_first_of(delimiters, cur_position);
      std::string token=sentence.substr(cur_position, next-cur_position);

      result+=lastDelimiter;
      if(words2remove.find(token)==words2remove.end())
         result+=token;//not forbidden

      if(next==std::string::npos)
        break;

      //prepare for the next iteration:  
      lastDelimiter=sentence[next];
      cur_position=next+1;
    }

    return result;
}

This method uses a set rather than a list of forbidden words because of the faster look-up. As delimiters any set of chars can be used e.g. " " or " ,.;".

It runs in O(n*log(k)) where n is the number of characters in the sentence and k the number of words in the forbidden set.

You may want to look into boost::tokonizer if you need a more flexible tokonizer and don't want to reinvent the wheel.

In case the number of forbidden words is large, you may consider to use std::unordered_set (c++11) or boost::unordered_set instead of std::set to reduce the expected running time of the algorithm to O(n).

Thank you, this is very detailed and helpful. I wish I could pick more than one best answer...! — Kyle B, Jan 11 '16 at 14:26

Search a string for any occurence of a word in a list of strings

2 Answers2