2

On a project I work on, we recently ran into the issue where we need to check if 2 strings have string formatting, (for translations).

/* A simple example: */
str = "%.200sSOMETEXT%.5fSOMEMORETEXT%d%ul%.*s%%";

/* Should be able to be validated to be the equivalent of: */
str = "%.200sBLAHBLAH%.5ftest%d%ul%.*s%%MORETEXT";

/* and... */
str = "%.200s%.5f%d%ul%.*s%%";

/* but not... */
str = "%.5f%.200s%d%ul%%%.*s";

So my question is:

Is there a way to validate 2 strings have equivalence string formatting?

Perhaps the answer is some very good regex expression, or existing tools or some example code from another project. I can't imagine we're the first project to run into this problem.

tshepang
  • 12,111
  • 21
  • 91
  • 136
ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • By intuition I'd try to revers a `printf()` by using a `scanf()`. Although there certainly are transformations possible using `printf()` which aren't 1:1. – alk Oct 17 '13 at 07:53
  • Using regex seems to be the right tool for the job. Especially if there many different format strings to test for. – user694733 Oct 17 '13 at 08:07

2 Answers2

1

Interesting problem.

I would try to implement a function that strips the non-formatting characters from a formatting string, thus leaving only the format specifiers. That should then, hopefully, be canonical enough to be compared.

Perhaps you'd need to further strip things like field widths, and (if you support it) argument indexes since those will differ for different translations.

It shouldn't be very hard to come up with the stripping function, format specifiers are pretty simple. Drop characters until you find a %, then check the following character, if it´s % then drop both, else copy characters until you find one of the "final" specifiers (d, f, s, u and so on).

unwind
  • 391,730
  • 64
  • 469
  • 606
1

Just as a followup/precision, our use case is to validate translations (po files), as printf mismatches between org string and translated one can lead to nasty crashes…

Currently I’m using that regex (python code, as we handle this in py), which is a basic representation of printf syntax:

>>> import re
>>> _format = re.compile(r"(?!<%)(?:%%)*%[-+#0]?(?:\*|[0-9]+)?(?:\.(?:\*|[0-9]+))?(?:[hljztL]|hh|ll)?[tldiuoxXfFeEgGaAcspn]").findall
>>> _format("%.200sSOMETEXT%.5fSOMEMORETEXT%d%ul%.*s%%")
['%.200s', '%.5f', '%d', '%u', '%.*s']
>>> _format("%.200sBLAHBLAH%.5ftest%d%ul%.*s%%MORETEXT")
['%.200s', '%.5f', '%d', '%u', '%.*s']
>>> _format("%.200s%.5f%d%ul%.*s%%")
['%.200s', '%.5f', '%d', '%u', '%.*s']

So a mere comparison between returned lists tells us whether those strings are printf-compatible or not.

This probably does not address all possible corner cases, but it works pretty well…

mont29
  • 348
  • 3
  • 12