How to parse escape element '\' and unicode character '\u' using boost regex in C++

Question

I am parsing a text file using boost regex in C++. I am looking for '\' characters from the file. This file also contains some unicode '\u' characters as well. So, is there a way to separate out '\' and '\u' character. Following is content of test.txt that I am parsing

"ID": "\u01FE234DA - this is id ",
"speed": "96\/78",
"avg": "\u01FE234DA avg\83"

Following is my try

#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <fstream>

using namespace std;
const int BUFSIZE = 500;

int main(int argc, char** argv) {

    if (argc < 2) {
        cout << "Pass the input file" << endl;
        exit(0);
    }

   boost::regex re("\\\\+");
   string file(argv[1]);
   char buf[BUFSIZE];

   boost::regex uni("\\\\u+");


   ifstream in(file.c_str());
   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      if (boost::regex_search(buf, re))
      {
          cout << buf << endl;
          cout << "(\) found" << endl;
          if (boost::regex_search(buf, uni)) {
              cout << buf << endl;
              cout << "unicode found" << endl;

          }

      }

   }
}

Now when I use above code it prints following

"ID": "\u01FE234DA - this is id ",
 (\) found
"ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
"avg": "\u01FE234DA avg\83"
 (\) found
 "avg": "\u01FE234DA avg\83"
 unicode found

Instead of I want following

 "ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
 "avg": "\u01FE234DA avg\83"
 (\) and unicode found

I think the code is not able to distinguish '\' and '\u' separately but I am not sure where to change what.

Your current code does **not** produce the output you show due to the commented-out statements. Also, what is wrong with running both checks? (It's a flawed method anyway. Probably better would be to not use regex and inspect one backslash at a time, going from first to last. "Let's use a regex - now you have two problems.") — Jongware, Apr 05 '16 at 21:59
If we keep this code as is then "ID" field is showed up twice. i.e "ID" is considered as unicode as well as (\) found one — kkard, Apr 05 '16 at 22:07
But I bet it does not work on `\\\u123 testing` (well - give or take a few more backslashes). Is there a particular reason to do this with regexes? As I said, iterating over the backslashes ought to be simple, straightforward, and robust. — Jongware, Apr 05 '16 at 22:21

Jerome Devost · Accepted Answer · 2016-04-06T13:03:20.333

Try using [^u] in your first regex to match any character that is not u.

boost::regex re("\\\\[^u]");  // matches \ not followed by u
boost::regex uni("\\\\u");  // matches \u

It's probably best to use one regex expression.

boost:regex re("\\\\(u)?"); // matches \ with or without u

Then check if the partial match m[1] is 'u':

m = boost::regex_search(buf, uni)
if (m && m[1] === "u") {  // pseudo-code
    // unicode
}
else {
    // not unicode
}

It's better to use regex for pattern matching. They seem more complex but they are actually easier to maintain once you get used to them and less bug-prone than iterating over strings one character at a time.

How to parse escape element '\' and unicode character '\u' using boost regex in C++

1 Answers1