5

We recently discovered a line of code that was doing the equivalent of

bool should_escape_control_char(char ch) {
    return (ch < 0x20);  // control chars are 0x00 through 0x1F
}

This works if plain char is unsigned; but if plain char is signed, then this filter accidentally catches negative chars as well. (The ultimate effect was that a naïve JSON encoder was encoding "é" as "\u00c3\u00a9" because to the encoder, it looked like a pair of negative chars which were then individually encoded.)

IMO, the original sin here is that we're comparing a plain char expression against an integer, in such a way that the result depends on the signedness of char. I wish the compiler had told us:

fantasy-warning: this comparison's result may depend on the signedness of plain char
    return (ch < 0x20);  // control chars are 0x00 through 0x1F
            ^~~~~~~~~
fantasy-note: cast the operand to silence this diagnostic
    return (ch < 0x20);  // control chars are 0x00 through 0x1F
            ~~
            (signed char)(ch)

I was surprised to discover that Clang gives no option to warn in this situation; and I don't see any warning option in GCC either.

  • Am I just not looking in the right place?
  • What tools / linters / static-analyzers exist that do warn in this situation?
Quuxplusone
  • 23,928
  • 8
  • 94
  • 159
  • If there were a compiler warning for this kind of situation, I believe it would emit more warnings than you could possibly imagine for completely legitimate code. This is a straight-up coding bug. – Eljay Feb 19 '21 at 17:03

2 Answers2

4

Your code is not portable even if you changed it to

bool should_escape_control_char(unsigned char ch)

as you're still making assumptions about the character encoding on your platform. Use

int std::iscntrl( int ch );

instead, or the C equivalent depending on the language you're using.

Reference https://en.cppreference.com/w/cpp/string/byte/iscntrl

(The C version is reachable from that site).

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
1

Static analysers that I use didn't diagnose the original example. Writing unit tests and compiling them with both unsigned and signed char could help catch bugs like this in automated testing stage.


When using unsigned numbers, it is safer to compare them with explicitly unsigned operands instead of letting signed operand be implicitly converted. So, given that char was assumed to be unsigned:

bool should_escape_control_char(char ch) {
    return ch < 0x20u;  // control chars are 0x00 through 0x1F
//                  ^
}

In this case, if the assumed signedness of char is wrong, (at least some?) compilers would warn about this when the char is signed and warnings are enabled:

warning: comparison of integer expressions of different signedness: 'char' and 'unsigned int' [-Wsign-compare]

Instead of relying magic numbers, it would be better to use std::iscntrl from the standard library:

bool
is_control_c0(unsigned char ch) {
    return std::iscntrl(ch
        // provide locale if not using currently active
    );
}

Note that a function accepting a single narrow char - i.e. a code unit - cannot match all control code points in UTF-8 because C1 control codes are encoded as two code units.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • Given "a naïve JSON encoder was encoding `"é"` as `"\u00c3\u00a9"`" from the question, I'm not certain that all the problems in that "naïve JSON encoder" can be fully addressed with `iscntrl()`. I suspect there's a lot more issues, hence the desire for better compiler diagnostics. – Andrew Henle Feb 19 '21 at 16:55
  • " are technically encoding " -> "are encoding". I'm not sure there is anything "technically" about it. – Bathsheba Feb 19 '21 at 16:56
  • @Bathsheba "technically" is there because there are *many* encodings where that assumption applies, and *many* use cases where one doesn't need to care about encodings where it doesn't apply. – eerorika Feb 19 '21 at 16:59
  • I'm asking for tools that warn on the original (buggy) code, not tools that *would* warn if the code had been written differently. If I had a time machine to go back and tell the programmer to write it differently, I would just tell them to add the cast in the first place. – Quuxplusone Feb 19 '21 at 17:59
  • @Quuxplusone Like I said in the answer: Unit tests. I simply showed a way of writing programs that is safer when making assumptions about signedness. – eerorika Feb 19 '21 at 18:16
  • 1) you cannot use `iscntrl` for this because it is locale-specific. JSON is not about locale but unicode codepoints. 2) problem with `iscntrl` is still that of the signedness. But it is even harder to make compiler to complain about that. – Antti Haapala -- Слава Україні Feb 19 '21 at 18:37
  • If I had a time machine I'd attempt to convince IBM that EBCDIC was a poor choice. – Bathsheba Feb 19 '21 at 18:39
  • @AnttiHaapala You could use `std::iscntrl(uchar, std::locale("en_US.UTF8"))` for example. – eerorika Feb 19 '21 at 18:41
  • but that won't work quite properly because the *chars* are not necessarily utf-8 codepoints. – Antti Haapala -- Слава Україні Feb 19 '21 at 18:42
  • This is a good answer. Have an upvote. The fact that it doesn’t necessarily target the OP is not an issue. – Bathsheba Feb 19 '21 at 18:42
  • @AnttiHaapala True, it can only work for C0 control codes. C1 codes would be encoded in two code units that aren't individually control code points. A function that has to match all control codes in UTF-8 would need to deal with strings. – eerorika Feb 19 '21 at 18:52
  • hmm I am actually at loss about how these C++ is-functions work. (unsigned char)0x85 => bad cast, signed char 0x85 => 0, wchar_t 0x85 => 1... – Antti Haapala -- Слава Україні Feb 19 '21 at 18:54