I'm currently building a bit of HTTP handling into a C program (compiled using glibc on Linux), which will sit behind an nginx instance, and figured I should be safe deferring argument tokenization to sscanf
in this scenario.
I was very pleased to find that extracting the query out of the URI was straightforward:
char *path = "/events?a=1&b=2&c=3";
char query[64] = {0};
sscanf(path, "%*[^?]?%64s HTTP", query); // query = "a=1&b=2&c=3"
but I was surprised how quickly things became i͏̠͚̣̗̲n͓̭̞̹t͈e҉̝̟̘̺r͈e̫st̩̟̠i͏͈͇n͏̠͍g̞͝ :(
int pos = -1;
char arg[32] = {0}, value[32] = {0};
int c = sscanf(query, "%32[^=]=%32[^&]&%n", &arg, &value, &pos);
For an input of a=1&b=2
, I get arg="a"
, value="1"
, c=2
, pos=4
. Perfect: I can now rerun sscanf on path + pos
to get the next argument. Why am I here?
Well, while a=1&
behaves identically to the above, a=1
produces arg="a"
, value="1"
, c=2
, and pos=-1
. What do I make of this?
Scrambling for the documentation, I read that
n Nothing is expected; instead, the number of characters consumed
thus far from the input is stored through the next pointer,
which must be a pointer to int. This is not a conversion and
does not increase the count returned by the function. The as‐
signment can be suppressed with the * assignment-suppression
character, but the effect on the return value is undefined.
Therefore %*n conversions should not be used.
where more than 50% of the paragraph refers to bookkeeping minutiae. The behavior I am seeing is not discussed.
Wandering around Google search results I quickly reached for Wikipedia's entry for Scanf_format_string (which was the top hit), but, uh...
Oookay... I feel like I'm in the tumbleweeds here using a feature nobody really looks at. That doesn't inspire my remaining confidence.
Taking a look at what appears to be where %n
is implemented in vfscanf-internal.c, I find that 60% of the code (lines) involves discussion regarding standards inconsistencies, 39.6% is implementation minutiae, and 0.4% is actual code (which consists in its entirety of "done++;
").
It *appears* that glibc's behavior is to leave the internal value done
(which I access using %n
) untouched - or rather, undefined - unless some operation alters it. It also appears that using %n
in this way was unforeseen and that I'm completely in "here be dragons" territory? :(
I don't think I'm going to be using scanf
...
For the sake of completeness, here's something that wraps up what I'm seeing.
#include <stdio.h>
void test(const char *str) {
int pos = -1;
char arg[32] = {0}, value[32] = {0};
int c = sscanf(str, "%32[^=]=%32[^&]&%n", (char *)&arg, (char *)&value, &pos);
printf("\"%s\": c=%d arg=\"%s\" value=\"%s\" pos=%d\n", str, c, arg, value, pos);
}
int main() {
test("a=1&b=2"); // "a=1&b=2": c=2 arg="a" value="1" pos=4
test("a=1&"); // "a=1&": c=2 arg="a" value="1" pos=4
test("a=1"); // "a=1": c=2 arg="a" value="1" pos=-1
}