strtok how to also include delimiters as tokens

Question

Right now I have code set up to divide up my string into tokens with delimiters of ,;= and space. I would also like to include the special characters as tokens.

char * cstr = new char [str.length()+1];
strcpy (cstr, str.c_str());

char * p = strtok (cstr," ");

while (p!=0)
{
    whichType(p);
    p = strtok(NULL," ,;=");
}

So right now if I print out the tokens of a string such as, asd sdf qwe wer,sdf;wer it would be

asd
sdf
qwe
wer
sdf
wer

I want it to look like

asd
sdf
qwe
wer
,
sdf
;
wer

Any help would be great. Thanks

I'd like to point out that Boost has [String Algorithms](http://www.boost.org/doc/libs/1_54_0/doc/html/string_algo.html), [Regex `regex_token_iterator`](http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/html/boost_regex/ref/regex_token_iterator.html) and, last but not least, [Boost Spirit](http://www.boost.org/doc/libs/1_54_0/libs/spirit/doc/html/spirit/qi.html). I've got a _lot_ of answers in the latter category, so, if you want an idea, just [search, e.g. with `delimiter`](http://stackoverflow.com/search?q=user%3A85371+%5Bboost-spirit%5D+delimiter) — sehe, Sep 26 '13 at 21:51

sehe · Accepted Answer · 2013-10-16T08:27:21.937

You need more flexibility. (Besides, strtok is a bad, error prone interface).

Here's a flexible algorithm that generates tokens, copying them to an output iterator. This means you can use it to fill a container of your choice, or print it directly to an output stream (which is what I'll use as a demo).

The behaviour is specified in option flags:

enum tokenize_options
{
    tokenize_skip_empty_tokens              = 1 << 0,
    tokenize_include_delimiters             = 1 << 1,
    tokenize_exclude_whitespace_delimiters  = 1 << 2,
    //
    tokenize_options_none    = 0,
    tokenize_default_options =   tokenize_skip_empty_tokens 
                               | tokenize_exclude_whitespace_delimiters
                               | tokenize_include_delimiters,
};

Not how I actually distilled an extra requirement that you hadn't named, but your sample implies: you want the delimiters output as tokens unless they're whitespace (' '). This is what the third option comes in for: tokenize_exclude_whitespace_delimiters.

Now here's the real meat:

template <typename Input, typename Delimiters, typename Out>
Out tokenize(
        Input const& input,
        Delimiters const& delim,
        Out out,
        tokenize_options options = tokenize_default_options
        )
{
    // decode option flags
    const bool includeDelim   = options & tokenize_include_delimiters;
    const bool excludeWsDelim = options & tokenize_exclude_whitespace_delimiters;
    const bool skipEmpty      = options & tokenize_skip_empty_tokens;

    using namespace std;
    string accum;

    for(auto it = begin(input), last = end(input); it != last; ++it)
    {
        if (find(begin(delim), end(delim), *it) == end(delim))
        {
            accum += *it;
        }
        else
        {
            // output the token
            if (!(skipEmpty && accum.empty()))
                *out++ = accum;   // optionally skip if `accum.empty()`?

            // output the delimiter
            bool isWhitespace = std::isspace(*it) || (*it == '\0'); 
            if (includeDelim && !(excludeWsDelim && isWhitespace))
            {
                *out++ = { *it }; // dump the delimiter as a separate token
            }

            accum.clear();
        }
    }

    if (!accum.empty())
        *out++ = accum;

    return out;
}

A full demo is Live on Ideone (default options) and Live on Coliru (no options)

int main()
{
    // let's print tokens to stdout
    std::ostringstream oss;
    std::ostream_iterator<std::string> out(oss, "\n"); 

    tokenize("asd sdf qwe wer,sdf;wer", " ;,", out/*, tokenize_options_none*/);

    std::cout << oss.str();
    // that's all, folks
}

Prints:

asd
sdf
qwe
wer
,
sdf
;
wer

Fixed two minor issues. (Decided that `\0` is whitespace too. This was relevant because I used raw null-terminated char[] as inputs.) — sehe, Sep 26 '13 at 21:48
Unrelated to the question, but I'm having trouble compiling this. Tried VS2010 and g++. — Andrew Tsay, Sep 28 '13 at 16:04
@atsay714 Oh well, as you're probably aware it's because of lacking c++11 support (did you _enable_ it on gcc? Because gcc has had that for _years_). Here's a strictly c++03 compatible version should you get stuck: http://coliru.stacked-crooked.com/a/4c4a7dab39aa78d0 — sehe, Sep 29 '13 at 12:00
Is there anyway to output the tokens to a string variable instead of stdout? — Andrew Tsay, Oct 16 '13 at 07:54
@atsay714 Of course, it's a generic algorithm which takes output iterator. Just use your desired output iterator! E.g. see edited answer or e.g. **[tokenizing to a vector, and sorting them](http://coliru.stacked-crooked.com/a/93a840e5904b7115)**. Hope that helps :/ — sehe, Oct 16 '13 at 08:30

score 5 · Answer 2 · edited Nov 17 '17 at 22:08

5

I'm afraid you cannot use strtok for that, you'll need a proper tokenizer.

If your tokens are simple, I suggest you code it manually, i.e., that you scan the string character by character. If they're not, I suggest that you take a look at several alternatives. Or, if it's really complicated, that you use a special tool like flex.

edited Nov 17 '17 at 22:08

rici

234,347
28
237
341

answered Sep 26 '13 at 19:24

nickie

5,608
2
23
37

Kumar Dewasish · Answer 3 · 2017-07-17T10:14:26.363

//TRY THE FOLLOWING CODE
#include <iostream>
#include <string>
#include <vector>

  int main()
  {
    std::string line = "asd sdf qwe wer,sdf;wer";
    std::vector<std::string> wordVector;
    std::vector<std::string>::iterator IwordVector;
    std::size_t prev = 0, pos;
    while ((pos = line.find_first_of(" ,;", prev)) != std::string::npos) {
      if (pos > prev)
        wordVector.push_back(line.substr(prev, pos-prev));
      prev = pos+1;
      if (std::string(1,line.at((unsigned int)pos)) != " ")
        wordVector.push_back(std::string(1,line.at((unsigned int)pos)));
    }
    if (prev < line.length())
      wordVector.push_back(line.substr(prev, std::string::npos));
    for(IwordVector = wordVector.begin(); IwordVector != wordVector.end(); IwordVector++)
      std::cout << "\n"<<*IwordVector;
    return 0;
  }

**OUPUT**: [root@kumar-vm ~]# ./a.out

asd 
sdf 
qwe 
wer 
, 
sdf 
;
wer[root@kumar-vm ~]#

strtok how to also include delimiters as tokens

3 Answers3