7

I'm currently building a bit of HTTP handling into a C program (compiled using glibc on Linux), which will sit behind an nginx instance, and figured I should be safe deferring argument tokenization to sscanf in this scenario.

I was very pleased to find that extracting the query out of the URI was straightforward:

char *path = "/events?a=1&b=2&c=3";

char query[64] = {0};

sscanf(path, "%*[^?]?%64s HTTP", query); // query = "a=1&b=2&c=3"

but I was surprised how quickly things became i͏̠͚̣̗̲n͓̭̞̹t͈e҉̝̟̘̺r͈e̫st̩̟̠i͏͈͇n͏̠͍g̞͝   :(

int pos = -1;
char arg[32] = {0}, value[32] = {0};

int c = sscanf(query, "%32[^=]=%32[^&]&%n", &arg, &value, &pos);

For an input of a=1&b=2, I get arg="a", value="1", c=2, pos=4. Perfect: I can now rerun sscanf on path + pos to get the next argument. Why am I here?

Well, while a=1& behaves identically to the above, a=1 produces arg="a", value="1", c=2, and pos=-1. What do I make of this?

Scrambling for the documentation, I read that

       n      Nothing  is expected; instead, the number of characters consumed
              thus far from the input is  stored  through  the  next  pointer,
              which  must  be  a pointer to int.  This is not a conversion and
              does not increase the count returned by the function.   The  as‐
              signment  can  be  suppressed  with the * assignment-suppression
              character, but the effect on  the  return  value  is  undefined.
              Therefore %*n conversions should not be used.

where more than 50% of the paragraph refers to bookkeeping minutiae. The behavior I am seeing is not discussed.

Wandering around Google search results I quickly reached for Wikipedia's entry for Scanf_format_string (which was the top hit), but, uh...

format specification is empty as of June 2020 Oookay... I feel like I'm in the tumbleweeds here using a feature nobody really looks at. That doesn't inspire my remaining confidence.

Taking a look at what appears to be where %n is implemented in vfscanf-internal.c, I find that 60% of the code (lines) involves discussion regarding standards inconsistencies, 39.6% is implementation minutiae, and 0.4% is actual code (which consists in its entirety of "done++;").

It *appears* that glibc's behavior is to leave the internal value done (which I access using %n) untouched - or rather, undefined - unless some operation alters it. It also appears that using %n in this way was unforeseen and that I'm completely in "here be dragons" territory? :(

I don't think I'm going to be using scanf...

For the sake of completeness, here's something that wraps up what I'm seeing.

#include <stdio.h>

void test(const char *str) {
  int pos = -1;
  char arg[32] = {0}, value[32] = {0};
  int c = sscanf(str, "%32[^=]=%32[^&]&%n", (char *)&arg, (char *)&value, &pos);
  printf("\"%s\": c=%d arg=\"%s\" value=\"%s\" pos=%d\n", str, c, arg, value, pos);
}

int main() {
  test("a=1&b=2"); // "a=1&b=2": c=2 arg="a" value="1" pos=4
  test("a=1&");    // "a=1&": c=2 arg="a" value="1" pos=4
  test("a=1");     // "a=1": c=2 arg="a" value="1" pos=-1
}
i336_
  • 1,813
  • 1
  • 20
  • 41
  • 1
    Off by 1 `sscanf(path, "%*[^?]?%64s HTTP", query);` --> `sscanf(path, "%*[^?]?%63s HTTP", query);` Others likewise – chux - Reinstate Monica Jun 29 '20 at 02:44
  • 1
    `(char *)&arg, (char *)&value` better as `arg, value`, – chux - Reinstate Monica Jun 29 '20 at 02:47
  • 1
    With `"a=1"` the first two fields are scanned and assigned, then `&` does not match the input, and so the format string is not parsed farther, which is consistent with the observed `c=2`, `pos=-1`. – dxiv Jun 29 '20 at 02:48
  • The "actual code" you're looking for in the vfscanf source is lines [655](https://sourceware.org/git?p=glibc.git;a=blob;f=stdio-common/vfscanf-internal.c;h=95b46dcbeb55b1724b396f02a940f3047259b926;hb=HEAD#l655) to 664, e.g `* ARG(int *) = read_in`. The `ARG(int *)` here is a macro wrapped around `va_arg` which just retrieves the argument you passed, namely `&pos`. So it does just what it should; dereferences your pointer and assigns the value of the counter `read_in` there. – Nate Eldredge Jun 29 '20 at 03:27
  • 1
    But the `%n` code is all irrelevant to the case at hand, because the point is that the previous failure of the `&` directive ensures that **this code never runs**. – Nate Eldredge Jun 29 '20 at 03:28
  • 1
    Also by the way, the discussion of "standard inconsistencies" at line 666 should now be obsolete; C99 and later fixed the example, making it clear that `sscanf("123", "%d%n%n%d", &d1, &n1, &n2, &d2)` should indeed return 1, which is what glibc continued doing all along. – Nate Eldredge Jun 29 '20 at 03:33
  • 1
    For the future, a better general reference than Wikipedia is [cppreference.com](https://en.cppreference.com/w/c/io/fscanf), which despite its name also covers C, and has a reputation for *usually* being correct while *somewhat* more readable than the language standard itself. Google definitely is a little too fond of Wikipedia. Though the thing about scanf returning on failure was already in your man page, I suspect, though probably not where you thought to look. – Nate Eldredge Jun 29 '20 at 03:43
  • @chux-ReinstateMonica: Thanks for the off-by-ones. I *thought* so! (...need to make double-checking a knee-jerk reaction...) – i336_ Jun 29 '20 at 04:16
  • @NateEldredge: Ah; my eyes glazed right over that and thought it was part of the housekeeping (saw bitwise ANDs and moved on). Woops. – i336_ Jun 29 '20 at 04:17
  • Thanks very much for the info and insights. – i336_ Jun 29 '20 at 04:18

1 Answers1

5

I think the C standard guarantees that the value of pos in your example remains unchanged.

C17 7.21.6.2 says, describing fscanf:

(4) The fscanf function executes each directive of the format in turn. When all directives have been executed, or if a directive fails (as detailed below), the function returns. Failures are described as input failures (due to the occurrence of an encoding error or the unavailability of input characters), or matching failures (due to inappropriate input).

[...]

(6) A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive,the directive fails and the differing and subsequent characters remain unread. Similarly, if end-of-file, an encoding error, or a read error prevents a character from being read, the directive fails.

("Multibyte character" here includes ordinary single-byte characters such as your &.)

So in your "a=1" example, the directives %32[^=], =, and %32[^&] all succeed, and now the end of the string has been reached. It's explained in 7.21.6.7 that for sscanf, "reaching the end of the string is equivalent to encountering end-of-file for the fscanf function." Hence no character can be read, so the & directive fails, and sscanf returns without doing anything further. The %n directive never executed, and so nothing happened that would have the right to modify the value of pos. Therefore it must have the same value it had before, namely -1.

I don't think this case was unforeseen; just that it's already covered by existing rules, and so nobody bothered to call it out explicitly.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Thanks very much for this clarification, it's exactly the kind of disambiguation I was looking for but didn't know where to find it. – i336_ Jun 29 '20 at 04:00
  • I was curious if this specification was recent (per your referencing C17); I found that (what I believe is) C89 appears to express identical intent (for my purposes): "*A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If one of the characters differs from one comprising the directive, the directive fails, and the differing and subsequent characters remain unread.*" (page 145 of `ansi-iso-9899-1990-1.pdf`) – i336_ Jun 29 '20 at 04:13
  • 1
    @i336_: Yeah, I don't have any reason to think it's changed at all. I just picked C17 because it *is* the most current and because I had the file handy. – Nate Eldredge Jun 29 '20 at 14:17