0

I'm working on a quasi-SCPI command parser and I want to split a string based on colons, ignoring quoted strings. I want to get an empty string if there is no text between colons.

If I use this regex expression in EditPad Pro 7.2.2, it does exactly what I want. (([^:\"']|\"[^\"]\"|'[^']')+)?

As an example, using this data string: :foo:::bar:baz

I get 6 hits: [empty],foo,[empty],[empty],bar,baz

So far, so good. However, in my code, using std::tr1::regex, I'm getting 9 hits with the same data string. It seems like I'm getting an extra empty hit after each non-empty hit.

void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
    // Split string by colons, but ignore text within quotes.
    static const std::tr1::regex split_by_colon("(([^:\"']|\"[^\"]*\"|'[^']*')+)?");

    raw_command_list.clear();
    raw_command_index = 0;

    DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum FULL '%S'"), full_command.c_str()));

    const std::tr1::sregex_token_iterator end;
    for (std::tr1::sregex_token_iterator it(full_command.begin(),
                                            full_command.end(),
                                            split_by_colon);
         it != end;
         it++)
    {
        raw_command_list.push_back(*it);
        const std::string temp(*it);
        DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum '%S'"), temp.c_str()));
    }

    DebugPrintf(ZONE_REMOTE, (TEXT("InitRawCommandEnum hits = %d"), raw_command_list.size()));
}

And here is my output:

InitRawCommandEnum FULL ':foo:::bar:baz'
InitRawCommandEnum ''
InitRawCommandEnum 'foo'
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum ''
InitRawCommandEnum 'bar'
InitRawCommandEnum ''
InitRawCommandEnum 'baz'
InitRawCommandEnum ''
InitRawCommandEnum hits = 9

The most important question is how can I get my regex search to yield one (and only one) hit for every token delimited by a colon? Is the problem with my search expression?

Or maybe I'm misinterpreting the results? Do the empty strings after the non-empty strings have a special meaning? If so, what? And if that's the case, then is the correct solution to simply ignore them?

As a side question, I'm deeply curious why my code is behaving differently than EditPad Pro. EditPad is a useful test environment for experimenting with regular expressions, and it would be nice to know what the gotchas are.

Thanks!

Ryan Clark
  • 23
  • 6
  • What happens if you remove the surrounding `(` `)?` ? I suspect this may be allowing the match to be optional, and since the character `:` doesn't match then technically the optional match did, so the empty capture is printed. Just a theory though... – JaredC Jan 12 '13 at 04:03
  • Without the outer-most parenthesis it is matching only one character, but that gave me an idea. `(([^:\"']|\"[^\"]\"|'[^']')*)` seems to be working in Edit Pad. I'll give it a try when get back in to the office tomorrow. – Ryan Clark Jan 13 '13 at 23:48

1 Answers1

1

It's still not clear to me what the meaning of the empty strings are, but I was able to work around them by ignoring them. I track the position of the hits within the search string and only process results that are farther along in the string.

Here's my code, without modification. Note that my regex search expression is slightly different, but that's not critical to the answer.

void RICommandState::InitRawCommandEnum(const std::string& full_command)
{
    // Split string by colons, but ignore text within quotes.
    static const std::tr1::regex split_by_colon("(?:[^:\"']|\"[^\"]*\"|'[^']*')*");

    raw_command_list.clear();
    raw_command_index = 0;

    std::tr1::sregex_iterator::difference_type minPosition = 0;
    const std::tr1::sregex_iterator end;
    for (std::tr1::sregex_iterator it(full_command.begin(),
                                      full_command.end(),
                                      split_by_colon);
         it != end;
         it++)
    {
        if (it->position() >= minPosition)
        {
            raw_command_list.push_back(it->str());
            minPosition = it->position() + it->length() + 1;
        }
    }
}
Ryan Clark
  • 23
  • 6