17

A recently discovered explanation for GTA lengthy load times(1) showed that many implementations of sscanf() call strlen() on their input string to set up a context object for an internal routine shared with other scanning functions (scanf(), fscanf()...). This can become a performance bottleneck when the input string is very long. Parsing a 10MB JSON file loaded as a string with repeated calls to sscanf() with an offset and a %n conversion proved to be a dominant cause for the load time.

My question is should sscanf() even read the input string beyond the bytes necessary for the conversions to complete? For example does the following code invoke undefined behavior:

int test(void) {
    char buf[1] = { '1' };
    int v;
    sscanf(buf, "%1d", &v);
    return v;
}

The function should return 1 and does not need to read more than one byte from buf, but is sscanf() allowed to read from buf beyond the first byte?


(1) references provided by JdeBP:
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/
https://news.ycombinator.com/item?id=26297612
https://github.com/biojppm/rapidyaml/issues/40

chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • 3
    FWIW, the reason for calling `strlen` first instead of "just" waiting for `\0` during the normal parse is that if you piggyback `sscanf` on top of the normal machinery for the rest of the `*scanf` family, the obvious implementation uses a pseudo `FILE` object built on the scanned string. That a `FILE` object typically includes a count, and a way to return `EOF`, which the rest of the code is expecting. (In other words, it would be a more involved rework to look for `\0` instead of `EOF`, or to lazily turn `\0` into `EOF`.) – Steve Summit Mar 01 '21 at 22:25
  • @SteveSummit: I am well aware of this fact and it is not *involved rework* to special case `'\0'` in the common code. I have done just that in my own implementations. `'\0'` cannot be part of a number, so no change needed in the number parsers. The other parsers are simple to adapt. Calling `strlen()` on an arbitrary long string is just unacceptable. – chqrlie Mar 01 '21 at 22:30
  • 1
    @chqrlie Deepest apologies for wasting your time; I explicitly prefaced my remark with "FWIW" anticipating that it wouldn't be interesting to all. – Steve Summit Mar 01 '21 at 22:32
  • 3
    "Does sscanf require a null terminated string as input?" and "should `sscanf()` even read the input string beyond the bytes necessary for the conversions to complete?" are questions that may have different answers. IMO it's perfectly fine that `sscanf` requires a null-terminated string, but seeking to the end of an arbitrarily long string before consuming any bytes is still a QoI problem. – trent Mar 08 '21 at 18:27

2 Answers2

11

Here are the relevant parts from the C Standard:

7.21.6.7 The sscanf function Synopsis

Synopsis

#include <stdio.h>
int sscanf(const char * restrict s, const char * restrict format, ...);

Description
The sscanf function is equivalent to fscanf, except that input is obtained from a string (specified by the argument s) rather than from a stream. Reaching the end of the string is equivalent to encountering end-of-file for the fscanf function. If copying takes place between objects that overlap, the behavior is undefined.

Returns
The sscanf function returns the value of the macro EOF if an input failure occurs before the first conversion (if any) has completed. Otherwise, the sscanf function returns the number of input items assigned, which can be fewer than provided for, or even zero, in the event of an early matching failure.

The input is specifically referred to as a string, so it should be null terminated

Albeit none of the characters in the string beyond the initial prefix that matches the conversion specifier and potentially the next byte that helped determine the end of the matching sequence are used for the conversion, these characters must be followed by a null terminator so the input is a well formed string, and it is conforming to call strlen() on it to determine the input length.

To avoid linear time complexity on long input strings, sscanf() should limit the scan for the end of string to a small size with strnlen() or equivalent and pass an appropriate refill function. Passing a huge length and letting the internal routine special case the null byte is an even better approach.

In the mean time, programmers should avoid passing long input strings to sscanf() and use more specialized functions for their parsing tasks, such as strtol(), which also requires a well formed C string, but is implemented in a more conservative way. This would also avoid potential undefined behavior on number conversions for out of range string representations.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
4

When the Standard was written, many library functions were handled identically by almost all existing implementations, but some implementations may have had good reasons for handling a few cases differently. If the number of implementations that would have reason to differ from the commonplace behavior was substantial, then the Committee would either require that all implementations behave in the common fashion (as happens when e.g. computing UINT_MAX+1u), or explicitly state that they were not required to do so (as when e.g. computing INT_MAX+1). In cases where there was a clear common behavior, but it might not be practical on all implementations, however, the Committee generally simply refrained from saying anything, on the presumption that most compilers would have no reason to deviate from the common behavior, and the authors of those that would have reason to deviate would be better placed than the Committee to judge the pros and cons of following the common behavior versus deviating from it.

The sscanf behavior at issue fits the latter pattern. The Committee didn't want to mandate that implementations which would have trouble if the data source didn't have a trailing zero byte must be changed to deal with such data sources, but nor did they want to require that programmers copy data from sources that don't have trailing zero bytes to places that do before using sscanf upon it even when their implementation wouldn't care about anything beyond the portion of the source that would be meaningfully examined. Since makers of implementations that require a trailing zero will likely block any change to the Standard that would require them to tolerate its absence, and programmers whose implementations that impose no such needless requirements will block any change to the Standard that would require that they add extra data-copying steps to their code, the situation will remain deadlocked unless people can agree to categorize implementations that impose the trailing-byte requirement as "conforming but deficient" and require that they indicate such deficiency via predefined macro or other such means.

supercat
  • 77,689
  • 9
  • 166
  • 211