1

It is not clear how to write portable code in C, using wide-character API. Consider this example:

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "C.UTF-8");
  wchar_t wc = L'ÿ';
  if (iswlower(wc)) return 0;
  return 1;
}

Compiling it with gcc-6.3.0 using -Wconversion option gives this warning:

test.c: In function 'main':
test.c:9:16: warning: conversion to 'wint_t {aka unsigned int}' from 'wchar_t {aka int}' may change the sign of the result [-Wsign-conversion]
if (iswlower(wc)) return 0;
             ^

To get rid of this warning, we cast to (wint_t), like iswlower((wint_t)wc), but this is unportable. The following example demonstrates why it is unportable.

#include <stdio.h>

/* this is our hypothetical implementation */
typedef signed int wint_t;
typedef signed short wchar_t;
#define WEOF ((wint_t)0xffffffff)

void f(wint_t wc)
{
    if (wc==WEOF)
      printf("BUG. Valid character recognized as WEOF. This is due to integer promotion. How to avoid it?\n");
}
int main(void)
{
    wchar_t wc = (wchar_t)0xffff;
    f((wint_t)wc);
    return 0;
}

My question is: how to make this example portable, and at the same time avoid the gcc warning.

Igor Liferenko
  • 1,499
  • 1
  • 13
  • 28
  • 3
    why you do you ask "How to avoid integer promotion in C?" in the question then change it to "how to make this example portable, and at the same time avoid the gcc warning" in the content? The main question can't be solved, because integer promotion always occur for types narrower than int – phuclv Mar 28 '17 at 05:50
  • This is similar to the situation as when `char` is signed and `(char)0xff` is promoted to an `int` that is equal to `EOF`. However, I'm not certain, but I think that the standard would require an implementation's `wchar_t` to either be larger than 16 bits or unsigned if `0xffff` were a valid code point in any of the supported locales. – Michael Burr Mar 28 '17 at 06:32
  • There is no problem here, the value of wchar_t wc is `-1` and the value of wint_t wc is `-1`. The explanation in the printf in the function f is not correct. The integer promotion (using sign extension) preserved the value. – 2501 Mar 28 '17 at 07:39
  • @2501 `WEOF` is `0xffffffff`, and `wc` is `0xffff`. They are different. The program does not recognize this. That is the problem. – Igor Liferenko Mar 28 '17 at 07:50
  • @LưuVĩnhPhúc Because the only way to achieve this is to avoid integer promotion (IMHO). And this is the question. – Igor Liferenko Mar 28 '17 at 07:54
  • @IgorLiferenko Representation is not the same as value. The same value can be represented differently. This is physically unavoidable due to different widths of types. You're basically arguing that the code doesn't preserve representations when you're operating with values. And this is true, it doesn't preserve representations, and there is no guarantee in C that it should. But it does preserve the value. – 2501 Mar 28 '17 at 07:58
  • @LưuVĩnhPhúc I just cannot think of a better formulation of the question. – Igor Liferenko Mar 28 '17 at 08:20

1 Answers1

2

To keep things simple, I'm going to assume that the platform/implementation I'm discussing has the following characteristics:

  • two's complement integer types
  • int is 32 bits
  • short is 16 bits

I'm also going to use C99 as a reference just because it's what I have open.

The standard says the following must be true about these types/macros:

  • wint_t must be able to have at least one value that does not correspond to any member of the extended character set (7.24.1/2)
  • WEOF has a value that does not correspond to any member of the extended character set (7.24.1/3)
  • wchar_t can represent all values of the largest extended character set (7.17/2)

Keep in mind that by the C standard's definition of "value", the value of (short int) 0xffff is the same as the value of (int) 0xffffffff - that is they both have the value -1 (given the assumptions stated at the beginning of this answer). This is made clear by the standard's description of the integer promotions (6.3.1.1):

If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions. All other types are unchanged by the integer promotions.

The integer promotions preserve value including sign.

I believe that when you combine these elements it seems that if WEOF has the value -1, then no item in an extended character set can have the value -1. I think that this means that in your implementation example, either wchar_t would have to be unsigned (if it remained a 16-bit type) or (wchar_t) 0xffff could not be a valid character.

But there's another alternative that I originally forgot (and is probably the best solution for your example implementation) is that the standard states in a footnote that the "value of the macro WEOF may differ from that of EOF and need not be negative". So your implementation's problem can be fixed by making WEOF == INT_MAX for example. That way it cannot have the same value as any wchar_t.

The WEOF value possibly overlapping with a valid character value is one that I suppose might occur in real implementations (even if the standard seems to prohibit it), and it's similar to issues that have been brought up regarding EOF possibly having the same value as some valid signed char value.

It might be of interest that for most (all?) functions that can return WEOF to indicate some sort of problem, the standard requires that the function set some addition indication about the error or condition (for example, setting errno to a particular value, or setting the end-of-file indicator on a stream).

Another thing to note is that it's my understanding that 0xffff is a non-character in UCS-2 or UTF-16 (no idea about any other 16-bit encodings that might exist).

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • to be portable, we cannot guarantee that `wchar_t` values will be interpreted as UCS-2 or UTF-16 – Igor Liferenko Mar 28 '17 at 07:58
  • `0xffff` *is* a valid character, because it can be represented by `wchar_t` in this example – Igor Liferenko Mar 28 '17 at 08:00
  • the quote you provided does not prevent `char c` to store character code 255, however, so it is not applicable here – Igor Liferenko Mar 28 '17 at 08:03
  • @IgorLiferenko: there's an aspect that I forgot about that I added to the answer - I think it's probably a reasonable assumption that an implementation that needed to support 16-bit signed `wchar_t` might take in order to avoid collision with `WEOF`. – Michael Burr Mar 28 '17 at 08:04
  • I don't understand what you're saying about `char c` being set to 255. Also note that the fact that `0xffff` can be represented by the `wchar_t` type doesn't mean that `0xffff` is a valid character in the extended character set. What I was saying is that if `0xffff` is a valid character then either `wchar_t` needs to be unsigned or (after my update) that `WEOF` can't be `-1`. – Michael Burr Mar 28 '17 at 08:08
  • It is by definition that `0xffff` is a valid character in the extended character set of this hypothetical implementation. As for `char`, the situation is similar - `char c = (char)0xff;` and `#define WEOF (-1)` - the same kind of collision. We get around this by casting to `(unsigned char)`, although the same logic cannot be applied with `wchar_t`. So, your edit explained the way out of this. – Igor Liferenko Mar 28 '17 at 08:17