String Tokenizer with multiple delimiters including delimiter without Boost

Question

I need to create string parser in C++. I tried using

vector<string> Tokenize(const string& strInput, const string& strDelims)
{
 vector<string> vS;

 string strOne = strInput;
 string delimiters = strDelims;

 int startpos = 0;
 int pos = strOne.find_first_of(delimiters, startpos);

 while (string::npos != pos || string::npos != startpos)
 {
  if(strOne.substr(startpos, pos - startpos) != "")
   vS.push_back(strOne.substr(startpos, pos - startpos));

  // if delimiter is a new line (\n) then add new line
  if(strOne.substr(pos, 1) == "\n")
   vS.push_back("\\n");
  // else if the delimiter is not a space
  else if (strOne.substr(pos, 1) != " ")
   vS.push_back(strOne.substr(pos, 1));

  if( string::npos == strOne.find_first_not_of(delimiters, pos) )
   startpos = strOne.find_first_not_of(delimiters, pos);
  else
   startpos = pos + 1;

        pos = strOne.find_first_of(delimiters, startpos);

 }

 return vS;
}

This works for 2X+7cos(3Y)

(tokenizer("2X+7cos(3Y)","+-/^() \t");)

But gives a runtime error for 2X

I need non Boost solution.

I tried using C++ String Toolkit (StrTk) Tokenizer

std::vector<std::string> results;
strtk::split(delimiter, source,
             strtk::range_to_type_back_inserter(results),
             strtk::tokenize_options::include_all_delimiters);

 return results;

but it doesn't give token as a separate string.

eg: if I give the input as 2X+3Y

output vector contains

2X+

3Y

Presumably you need to protect `pos = str.find_first_of(delimiters, lastPos)` from the case where `lastPos` is `npos`. — ooga, Jul 01 '15 at 05:05
If you're going to show code using a non-Standard library (ostensibly [this](http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer), you should name it in the question, provide a link, and consider adding a related tag to your question. — Tony Delroy, Jul 01 '15 at 06:05
I add that strtk because to say that solution wasn't able to fix my problem. Will add the link now — user2473015, Jul 01 '15 at 14:43

score 2 · Answer 1 · answered Jul 01 '15 at 05:07

What's probably happening is this is crashing when passed npos:

lastPos = str.find_first_not_of(delimiters, pos);

Just add breaks to your loop instead of relying on the while clause to break out of it.

if (pos == string::npos)
  break;
lastPos = str.find_first_not_of(delimiters, pos);

if (lastPos == string::npos)
  break;
pos = str.find_first_of(delimiters, lastPos);

score 1 · Accepted Answer · answered Jul 01 '15 at 06:19

Loop exit condition is broken:

while (string::npos != pos || string::npos != startpos)

Allows entry with, say pos = npos and startpos = 1.

So

strOne.substr(startpos, pos - startpos)
strOne.substr(1, npos - 1)

end is not npos, so substr doesn't stop where it should and BOOM!

If pos = npos and startpos = 0,

strOne.substr(startpos, pos - startpos)

lives, but

strOne.substr(pos, 1) == "\n"
strOne.substr(npos, 1) == "\n"

dies. So does

strOne.substr(pos, 1) != " "

Sadly I'm out of time and can't solve this right now, but QuestionC's got the right idea. Better filtering. Something along the lines of:

    if (string::npos != pos)
    {
        if (strOne.substr(pos, 1) == "\n") // can possibly simplify this with strOne[pos] == '\n'
            vS.push_back("\\n");
        // else if the delimiter is not a space
        else if (strOne[pos] != ' ')
            vS.push_back(strOne.substr(pos, 1));
    }

Jordan Harris · Answer 3 · 2015-07-01T07:46:28.573

I created a little function that splits a string into substrings (which are stored in a vector) and it allows you to set which characters you want to treat as whitespace. Normal whitespace will still be treated as whitespace, so you don't have to define that. Actually, all it does is turns the character you defined as whitespace into actual whitespace (space char ' '). Then it runs that in a stream (stringstream) to separate the substrings and store them in a vector. This may not be what you need for this particular problem, but maybe it can give you some ideas.

// split a string into its whitespace-separated substrings and store
// each substring in a vector<string>. Whitespace can be defined in argument
// w as a string (e.g. ".;,?-'")
vector<string> split(const string& s, const string& w)
{
    string temp{ s };
    // go through each char in temp (or s)
    for (char& ch : temp) {     
        // check if any characters in temp (s) are whitespace defined in w
        for (char white : w) {  
            if (ch == white)
                ch = ' ';       // if so, replace them with a space char (' ')
        }
    }

    vector<string> substrings;
    stringstream ss{ temp };

    for (string buffer; ss >> buffer;) {
        substrings.push_back(buffer);
    }
    return substrings;
}

Interesting, but very heavy on the brute force. Have you considered using a `set` in place of `string` in w? You can reduce the `for (char white : w)` loop to `if (w.find(ch) != w.end())` Not awesome but not N-squared. — user4581301, Jul 01 '15 at 05:30
Hmm... I haven't thought of that. To be honest, I'm pretty new to C++ and programming in general, so there's a lot I don't know. I'll have to give that a try though and test the performance of both ways. I do agree that the way I'm doing it now is on the heavy side. Hey, it works though. I'm always down to try a different, more efficient way. Thanks for the comment. — Jordan Harris, Jul 01 '15 at 05:43

score 0 · Answer 4 · answered Jul 01 '15 at 05:58

0

Would be great if you could share some info on your environment. Your program ran fine with an input value of 2X on my Fedora 20 using g++.

answered Jul 01 '15 at 05:58

Faisal

361
1
6

1

This is answer is more appropriate as a comment and not really an answer to the question – SteveFerg Jul 01 '15 at 06:19
I'm in Win 8.1 with MinGW C++ compiler – user2473015 Jul 01 '15 at 14:38

String Tokenizer with multiple delimiters including delimiter without Boost

4 Answers4