I'm trying to to parse a string of type <tag>=<value>
using regular expressions but have hit some issues adding support for quoted values. The idea is that any unquoted values should be trimmed of leading / trailing white space so that [ Hello ]
becomes [Hello]
(Pls ignore the square brackets.)
However, when the value is quoted, I want anything up to and including the double quotes to be removed but no further, so [ " Hello World " ]
would become [" Hello World "]
So far, I've come up with the following code with a pattern match for this (note that some of the character have been escaped or doubly escaped to avoid them being interpreted as tri-graphs or other C format characters.)
void getTagVal( const std::string& tagVal )
{
boost::smatch what;
static const boost::regex pp("^\\s*([a-zA-Z0-9_-]+)\\s*=\\s*\"\?\?([%:\\a-zA-Z0-9 /\\._]+?)\"\?\?\\s*$");
if ( boost::regex_match( tagVal, what, pp ) )
{
const string tag = static_cast<const string&>( what[1] );
const string val = static_cast<const string&>( what[2] );
cout << "Tag = [" << tag << "] Val = [" << val << "]" << endl;
}
}
int main( int argc, char* argv[] )
{
getTagVal("Qs1= \" Hello World \" ");
getTagVal("Qs2=\" Hello World \" ");
getTagVal("Qs3= \" Hello World \"");
getTagVal("Qs4=\" Hello World \"");
getTagVal("Qs5=\"Hello World \"");
getTagVal("Qs6=\" Hello World\"");
getTagVal("Qs7=\"Hello World\"");
return 0;
}
Taking out the double escaping, this breaks down as:
^
- Start of line.\s*
- an optional amount of whitespace.([a-zA-Z0-9_-]+)
- One or more alphanumerics or a dash or underscore. This is captured as the tag.\s*
- an optional amount of whitespace.=
- an "equal" symbol.\s*
- an optional amount of whitespace."??
- an optional double quote (non-greedy).([%:\a-zA-Z0-9 /\._]+?)
- One or more alphanumerics or a space, underscore, percent, colon, period, forward or back slash. This is captured as the value (non-greedy)."??
- an optional double quote (non-greedy).\s*
- an optional amount of whitespace.$
- End of line
For the example calls in main()
, I would expect to get:
Tag = [Qs1] Val = [ Hello World ]
Tag = [Qs2] Val = [ Hello World ]
Tag = [Qs3] Val = [ Hello World ]
Tag = [Qs4] Val = [ Hello World ]
Tag = [Qs5] Val = [Hello World ]
Tag = [Qs6] Val = [ Hello World]
Tag = [Qs7] Val = [Hello World]
but what I actually get is:
Tag = [Qs1] Val = [" Hello World ]
Tag = [Qs2] Val = [" Hello World ]
Tag = [Qs3] Val = [" Hello World ]
Tag = [Qs4] Val = [" Hello World ]
Tag = [Qs5] Val = ["Hello World ]
Tag = [Qs6] Val = [" Hello World]
Tag = [Qs7] Val = ["Hello World]
So it's almost correct but for some reason the first quote is hanging around in the output value even though I specifically bracket the value section of the regex with the quote outside it.