3

The logging facility for our C++ project is about to be refactored to use repeated left-shift operators (in the manner of Qt's qDebug() syntax) instead of printf-style variadic functions.

Suppose the logging object is called logger. Let's say we want to show the ip and port of the server we connected to. In the current implementation, the usage is:

logger.logf("connected to %s:%d", ip, port);

After the refactor, the above call would become:

logger() << "connected to" << ip << ":" << port;

Manually replacing all these calls would be extremely tedious and error-prone, so naturally, I want to use a regex. As a first pass, I could replace the .logf(...) call, yielding

logger() "connected to %s:%d", ip, port;

However, reformatting this string to the left-shift syntax is where I have trouble. I managed to create the separate regexes for capturing printf placeholders and comma-delimited arguments. However, I don't know how to properly correlate the two.

In order to avoid repetition of the fairly unwieldy regexes, I will use the placeholder (printf) to refer to the printf placeholder regex (returning the named group token), and (args) to refer to the comma-delimited arguments regex (returning the named group arg). Below, I will give the outputs of various attempts applied to the relevant part of the above line, i.e.:

"connected to %s:%d", ip, port
  • /(printf)(args)/g produces no match.

  • /(printf)*(args)/g produces two matches, containing ip and port in the named group arg (but nothing in token).

  • /(printf)(args)*/g achieves the opposite result: it produces two matches, containing %s and %d in the named group token, but nothing in arg.

  • /(printf)*(args)*/g returns 3 matches: the first two contain %s and %d in token, the third contains port in arg. However, regexp101 reports "20 matches - 207 steps" and seems to match before every character.

  • I figured that perhaps I need to specify that the first capturing group is always between double quotes. However, neither /"(printf)"(args)/g nor /"(printf)(args)/g produce any matches.

  • /(printf)"(args)/g produces one (incorrect) match, containing %d in group token and ip in arg, and substitution consumes the entire string between those two strings (so entering # for the substitution string results in "connected to %s:#, port. Obviously, this is not the desired outcome, but it's the only version where I could at least get both named groups in a single match.

Any help is greatly appreciated.

Edited to correct broken formatting

  • I don't believe that a simple regex can handle all possibilities here. If I faced such a task, I'd spend some time and knock off a Perl script to sift through the code and transmogrify it, appropriately. – Sam Varshavchik Jul 08 '16 at 00:14
  • It is simply not possible to do this with a regex, at least as defined in computer science. – user253751 Jul 08 '16 at 00:20
  • Consider that the following is a valid construct as far as `printf` style is concerned: `logger.logf("connected to %.*s:%-4d", 16, ip, port);`. – dxiv Jul 08 '16 at 00:28
  • @engineer14 \[*replying to a just deleted comment, yet the point is still valid*] It's not just `extra formatting`. For example `%.*s` is a common way to `printf` strings that are not nul-terminated (or, to be pedantic, *char arrays*). Ignoring the `precision` specifier changes not just the formatting, but actually the very semantics in those cases. – dxiv Jul 08 '16 at 01:19
  • Doing this entirely with regex-es and getting it all correct is extremely difficult. Even quoted strings with no interpolations are challenging. `logger.logf("a" "b" "\"");` It 's probably easier to write a little char-by-char translator (e.g. in c++) than to get the regexes right. – Gene Jul 08 '16 at 01:23
  • @dxiv I actually tried (read: struggled) to get the `*` placeholders sensibly implemented in the regex until I realized that I couldn't recall a single instance where they were used. A quick grep through the codebase confirmed this: there appear to be no `*` formatting directives used anywhere. Incidentally, this means that the number of printf placeholders must match the number of arguments exactly, which should make the solution (whatever it might be) slightly easier. –  Jul 08 '16 at 07:21
  • @carouselambra any feedback on my answer? – Thomas Ayoub Jul 12 '16 at 14:16
  • sorry, I meant to post the Python script I eventually used, but real life intervened, will do it soon –  Aug 04 '16 at 07:24

1 Answers1

0

Disclaimer: This is a workaround, it's far from perfect and may lead to errors. Be careful when you'll commit the changes and, if you can, make a colleague proofread the diff to reduce the chances of disturbance.


You may try this multi-steps replacement from the max number of argument you have in the solution to the min (here I'll do from 3 to 0).

Let's consider logger.logf("connected to %s:%d some %s random text", ip, port, test);

You can match this with this regex: logger.logf\("(.*?)(%[a-z])(.*?)(%[a-z])(.*?)(%[a-z])(.*?)",(.*?)(?:, (.*?))?(?:, (.*?))?\); which will give you the following groups:

1.  [75-88] `connected to `
2.  [88-90] `%s`
3.  [90-91] `:`
4.  [91-93] `%d`
5.  [93-99] ` some `
6.  [99-101]    `%s`
7.  [101-113]   ` random text`
8.  [115-118]   ` ip`
9.  [120-124]   `port`
10. [126-130]   `test`

Replace with logger() << "\1" << \8 << "\3" << \9 << "\5" << \10 << "\7"; will give you

logger() << "connected to " << ip << ":" << port << " some " << test << " random text";


Now step with 2 args, example string is logger.logf("connected to %s:%d some random text", ip, port);, corresponding regex is logger.logf\("(.*?)(%[a-z])(.*?)(%[a-z])(.*?)",(.*?)(?:, (.*?))?\);

The matching is the following:

1.  [13-26] `connected to `
2.  [26-28] `%s`
3.  [28-29] `:`
4.  [29-31] `%d`
5.  [31-48] ` some random text`
6.  [50-53] ` ip`
7.  [55-59] `port`

And the replace string: logger() << "\1" << \6 << "\3" << \7 << "\5"; outputs:

logger() << "connected to " << ip << ":" << port << " some random text";


Input logger.logf("Some %s text", port);

Regex logger.logf\("(.*?)(%[a-z])(.*?)",(.*?)\);

Replacement logger() << "\1" << \4 << "\3";

logger() << "Some " << port << " text";


What about empty groups?

Let's say input is not logger.logf("Some %s text", port); but logger.logf("Some %s", port);. The output will then be:

logger() << "Some " << port << "";

You'll have to remove << "" to get something clean.

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142