0

I need to process some Win-1251-encoded text (8-bit encoding, uses some of 128..255 for Cyrillic). As far as I can tell, C was created with 7-bit ASCII in mind, no explicit support for single-byte chars above 127. So I have several questions:

  • Which is the more proper type for this text: char[] or unsigned char[]?
  • If I use unsigned char[] with built-in functions (strlen, strcmp), the compiler warns about implicit casts to char*. Can such a cast break something? Should I re-implement some of the functions to support unsigned char strings explicitly?
Alex
  • 1,165
  • 2
  • 9
  • 27
  • For your purpose, you may check your compiler, if it use signed or unsigned values for char. Check all compilers you may use. Most compilers have also a flag to change the "signess" of char. – Giacomo Catenazzi Oct 22 '20 at 07:22

1 Answers1

-1

C has three distinct character types, signed char, unsigned char, and char, which may be either signed or unsigned. For strings, you should just use char, since that will play nice with all the built-in functions. They'll all also work fine on characters with numeric values greater than 127. You should have no problems with your case using char.

Carl Norum
  • 219,201
  • 40
  • 422
  • 469
  • `char *` is okay for `strlen`, but others, like `isspace`, take `unsigned char` values. – Eric Postpischil Oct 21 '20 at 22:25
  • 1
    @EricPostpischil: no. isspace take `int`. In general single characters are passed as `int`, for compatibility with old C. – Giacomo Catenazzi Oct 22 '20 at 07:10
  • Yeah. Usually you can use standard string functions regardless the signess of char (POSIX systems, and many other, where *char is a single byte*). But this is valid only if you are using functions which do not check the semantic of the characters (e.g. for sorting, checking alphanumeric characters, etc.) – Giacomo Catenazzi Oct 22 '20 at 07:28
  • 1
    @GiacomoCatenazzi: As I wrote, `isspace`, and the other `` functions, takes an `unsigned char` **value**. The **type** of the argument is of course `int`, but the value in it should be non-negative or `EOF`, per C 2018 7.4 1: “In all cases the argument is an `int`, the value of which shall be representable as an `unsigned char` or shall equal the value of the macro `EOF`. If the argument has any other value, the behavior is undefined.” If you have an array `x` of `char`, and `char` is signed, passing `x[i]` to `isspace` can pass a negative value, and the behavior is not defined. – Eric Postpischil Oct 22 '20 at 11:21
  • 1
    @GiacomoCatenazzi: There are also problems in the other direction. For example, `fgetc` returns an `unsigned char` converted to an `int`. If this value is assigned to a `char`, it is converted. If `char` is signed and the `unsigned char` value is not representable in `char`, the behavior is not portable—the C standard says the result of the conversion is implemented-defined or raises an implementation-defined signal. – Eric Postpischil Oct 22 '20 at 11:24