3

Considering the following code:

#include <stdio.h>
#include <locale.h>

int main()
{
    char test[100];

    printf("WITHOUT LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡\n");

    setlocale(LC_CTYPE, "Spanish");

    printf("WITH LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡\n");

    fgets(test, 100, stdin);

    printf("WITH FGETS AND LOCALE: %s\n", test);
    return 0;

}

And the following input for fgets:

á, é, í, ó, ú, ü, ñ, ¿, ¡

I'd expect it to support the special characters according to the locale that has been set up beforehand. However, this is the output:

WITHOUT LOCALE: ß, Ú, Ý, ¾, ·, ³, ±, ┐, í
WITH LOCALE: á, é, í, ó, ú, ü, ñ, ¿, ¡
WITH FGETS AND LOCALE:  , ', ¡, ¢, £, ?, ¤, ¨, ­

Any idea about what could be happening?

Kurolox
  • 147
  • 8

1 Answers1

2

As I am repeatedly encountering questions like these in my 9-to-5 work, I came up with a side-by-side table of common 8-bit encodings.

Using that table, it appears that:

  • your editor saved the source in CP-1252 (where e.g. 'ó' -> 0xf3)
  • the first output line is that byte interpreted as (DOS) CP-850 (0xf3 -> '¾'),
  • the second line (after setlocale()) is CP-1252 encoding (0xf3 -> 'ó'),
  • the third line is input read in CP-850 and displayed as CP-1252 ('ó' -> 0xa2 -> '¢').

(I assumed a Windows platform -- CP-1252 -- as non-Windows platforms would not come up with CP-850 unless forced to at gunpoint. The source encoding could also be ISO 8859-1 / Western European, or ISO 8859-9 / Turkish, impossible to tell apart with the given character set. It could not be ISO 8859-15, as that would have turned 'ñ' into '€', not '¤'. It could not be any other ISO 8859 encoding, as only -1, -9 and -15 turn '¿' into '┐'.)

Note that the interpretation of non-ASCII-7 characters in C source code is implementation-defined, so you have to make sure that your editor, the terminal (if any), and the compiler agree on the encoding used. If at all possible, set your environment to use Unicode (UTF-8 being the most practical) throughout, to avoid exactly this kind of problem. I also recommend using octal escapes for anything non-ASCII-7 in your source, as you don't know what encoding settings others will use when feeding your source to their editors / compilers.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Thanks for the answer! It's been insightful. However, I'm not sure why is the third print displayed as Latin (forgive me, I'm not very experienced at programming) – Kurolox Nov 28 '17 at 12:04
  • @Kurolox: There are several settings that can come into play here. Your (unknown) editor assumes an encoding, possibly even one for display and another for saving. Your terminal (unknown, but from CP-850 playing a role here probably DOS box) has an encoding setting as well, possibly coming into play when running the (DOS?) editor as well as when displaying output and accepting input. Your compiler has an encoding setting when interpreting the source file. Without knowing any of these, it is a bit hard to say exactly what went "wrong". (It's not "wrong", every part does exactly what it is... – DevSolar Nov 28 '17 at 12:18
  • ...told to do, it's just not what you expected.) Generally, setting everything to a "catch-all" encoding all the time is the best way to go, and this "catch-all" encoding is UTF-8. Scrap every part of the toolchain that is not able to support setting to UTF-8. Since you don't know what editor / compiler *others* will be using on your source, I additionally recommend using octal escapes for everything in your source not ASCII-7. – DevSolar Nov 28 '17 at 12:21
  • @Kurolox: Note I edited my "reasoning" bullet points, I had a think-o in there. – DevSolar Nov 28 '17 at 12:37