0

How to extract a search term using boost::regex in C++ , especially when the term has html entities in it.

Eg

p=test&test&sort=price
Search Term would be 'test&test'

p=test&test&sort=price
Search Term would be 'test&test'

Boost Regex code is as follows

bool regexExtract(const string &strInput,string &strOutput,const string& regex)
{
  bool succ = false;
  if ( ! strOutput.empty() )
  {
    boost::regex re(regex, boost::regex::perl);//,boost::regex::perl | boost:regex::icase);
    boost::sregex_iterator res(strInput.begin(),strInput.end(),re);
    boost::sregex_iterator end;
    for (; res != end; ++res)
        cout << (*res)[0] << std::endl;
  }
  return succ;
}

with regex string re = "p=(([^&;#]|(&.*?;))*).*"; it prints below

p=test&amp;test&sort=asc

while same regex in perl works perfectly fine

echo "p=test&asd;test&sort=asc" | perl -ne 'if ( $_ =~ /p=(([^\&;#]|(&.*?;))*).*/){print $1;}'
test&asd;test
Josh Kelley
  • 56,064
  • 19
  • 146
  • 246
Shashi
  • 331
  • 3
  • 12
  • Why do you have HTML entities in your URL-style query parameters? URL query parameters should be [percent-encoded](http://en.wikipedia.org/wiki/Percent-encoding) instead of using HTML entities. – Josh Kelley Sep 10 '13 at 12:20
  • hi josh , we get user search queries that are html encoded , and want to apply regex to grab the search phrase – Shashi Sep 10 '13 at 12:33
  • BTW i tried some regex in perl , it worked `echo "p=test&asd;test&sort=asc" | perl -ne 'if ( $_ =~ /p=(([^\&;#]|(&.*?;))*).*/){print $1;}` which works , but same regex dint work n C++ boost ? – Shashi Sep 10 '13 at 12:34
  • Got it somewhat working to the expectation with `regex : string re = "p=(([^&;#]+|(&.*?;))*)&?";` and following code snippet `boost::regex re(strOutput, boost::regex::perl);//,boost::regex::perl | boost:regex::icase); boost::sregex_iterator res(strInput.begin(),strInput.end(),re); boost::sregex_iterator end; boost::smatch what; for (; res != end; ++res) cout << (*res)[0] << std::endl;` , The result is **p=testtest&john&**, But not sure why **"p="** and **"&"** is still picked though – Shashi Sep 11 '13 at 06:33

0 Answers0