0

I need to encode Latin ISO−8859-1 chars into to UTF-8 (and the reverse operation too).

I first used this answer Is there a way to convert from UTF8 to iso-8859-1? to perform the operation and it works;

Now I want to use the libiconv which provides all the conversion mechanisms and should help me to keep my code simpler.

I have followed the example provided here: https://www.lemoda.net/c/iconv-example/iconv-example.html

and I wrote a method that looks like this:

char *iconvISO2UTF8(char *iso) {
    iconv_t iconvDesc = iconv_open ("ISO−8859-1", "UTF-8//TRANSLIT//IGNORE");

    if (iconvDesc == (iconv_t) - 1) {
        /* Something went wrong.  */
        if (errno == EINVAL)
            fprintf(stderr, "conversion from '%s' to '%s' not available", "ISO−8859−1", "UTF-8");           
        else
            fprintf(stderr, "LibIcon initialization failure");          

        return NULL;
    }

    size_t iconv_value;
    char * utf8;
    size_t len;
    size_t utf8len; 
    char * utf8start;

    int len_start;


    len = strlen (iso);
    if (! len) {        
        fprintf(stderr, "iconvISO2UTF8: input String is empty.");           
        return NULL;
    }

    /* Assign enough space to put the UTF-8. */
    utf8len = 2 * len;
    utf8 = calloc (utf8len, sizeof (char));
    if (! utf8) {
        fprintf(stderr, "iconvISO2UTF8: Calloc failed.");           
        return NULL;
    }
    /* Keep track of the variables. */
    utf8start = utf8;
    len_start = len;

    iconv_value = iconv (iconvDesc, & iso, & len, & utf8, & utf8len);
    /* Handle failures. */
    if (iconv_value == (size_t) - 1) {      
        switch (errno) {
                /* See "man 3 iconv" for an explanation. */
            case EILSEQ:
                fprintf(stderr, "iconv failed: Invalid multibyte sequence, in string '%s', length %d, out string '%s', length %d\n", iso, (int) len, utf8start, (int) utf8len);             
                break;
            case EINVAL:
                fprintf(stderr, "iconv failed: Incomplete multibyte sequence, in string '%s', length %d, out string '%s', length %d\n", iso, (int) len, utf8start, (int) utf8len);              
                break;
            case E2BIG:
                fprintf(stderr, "iconv failed: No more room, in string '%s', length %d, out string '%s', length %d\n", iso, (int)  len, utf8start, (int) utf8len);                              
                break;
            default:
                fprintf(stderr, "iconv failed, in string '%s', length %d, out string '%s', length %d\n", iso, (int) len, utf8start, (int) utf8len);                             
        }
        return NULL;
    }


    if(iconv_close (iconvDesc) != 0) {
        fprintf(stderr, "libicon close failed: %s", strerror (errno));          
    }

    return utf8start;

}

When I call this fonction with plain old ascii-like characters, like "abracadabra", iconv works. But as soon as I send accentuated chars to it, like "éàèüöä' then the iconv() call fails with a EILSEQ code:

iconv failed: Invalid multibyte sequence, in string 'éàèüöä', length 6, out string '', length 12

Here is a sample main program that crash when stored in a source file encoded with ISO−8859-1 and compiled on a linux system with ISO−8859-1 as default charset:

int main(int argc, char **argv) {
    char *iso1 = "abracadabra";
    char *utf = iconvISO2UTF8(iso1);
    puts(utf);
    free(utf);

    char *iso2 = "éàèüöä";
    utf = iconvISO2UTF8(iso2);
    puts(utf);
    free(utf);
}

Is it possible to run this kind of conversion with iconv ? If yes what's wrong in this code ?

Guillaume
  • 5,488
  • 11
  • 47
  • 83
  • 1
    What is `iso`, i.e. where does it come from? Are you sure it's in the proper encoding? There shouldn't be any multibyte characters in a Latin-1 string. – unwind Sep 17 '18 at 09:23
  • iso is a char* that comes from the program (itself encoded in ISO−8859-1) and that must be converted into utf-8 to be used in another lib (jansson) that handles JSON. I have edited my question. – Guillaume Sep 17 '18 at 09:26
  • 1
    The compiler’s *source* character set, which is the encoding your source file is saved in, is not necessarily the same as the *execution* character set, nor the character set string literals are encoded in. If you want to encode a string constant is ISO-8859-1, you can use escapes such as `"\xe2\xe9xef\xf8\xf9\xfd"`. This will work in any source encoding (and UTF-8 is the only encoding some compilers, including clang, support). You could also `#define EACUTE "\xe9"` and then take advantage of the preprocessor to write `"fianc" EACUTE`. – Davislor Sep 17 '18 at 10:32
  • In C11, you can also write `u8"Üñìçõðæ"` to get UTF-8. If your source encoding does not support a given Unicode codepoint, you can use `\u` or `\U` escape codes in any source encoding. – Davislor Sep 17 '18 at 10:38
  • I strongly recommend saving all C source files as UTF-8 with a byte order mark. (Some versions of MSVC cannot compile UTF-8 without the BOM, and some other compilers cannot compile anything but UTF-8 or ASCII, so that is the only thing that just works on every compiler I need to use.) Saving in a different encoding is *not* a safe, a portable, nor a robust way to specify the encoding of string literals, but it *will* break your program in some toolchains. For example, when you posted here, your fragment was converted to UTF-8. – Davislor Sep 17 '18 at 10:44
  • Sidenote: C does not support _methods_. – too honest for this site Sep 17 '18 at 11:39
  • https://softwareengineering.stackexchange.com/questions/20909/method-vs-function-vs-procedure @Davislor, sure this is a best practice; however I'm working with a huge legacy code base, so this risky migration (hard coded buffer length all around the place) won't happen. – Guillaume Sep 17 '18 at 11:49
  • @Guillaume That makes perfect sense. Pragmatism over dogmatism! If the code would silently break when the source execution set is not Latin-1, at least consider writing your strings with portable and obvious `\x` escapes. Your project is at high risk of bit rot. – Davislor Sep 17 '18 at 18:10

1 Answers1

5

Please read the iconv_open(3) manual page carefully:

iconv_t iconv_open(const char *tocode, const char *fromcode);

If you're converting to UTF-8 from ISO 8859-1 then this is at odds:

iconv_t iconvDesc = iconv_open ("ISO−8859-1", "UTF-8//TRANSLIT//IGNORE");

It should say

iconv_t iconvDesc = iconv_open ("UTF-8//TRANSLIT//IGNORE", "ISO−8859-1");
  • 1
    Ahem, it was obviously a PBCK case... In my mind it was convert(from, to) and I didn't pay attention to the arg order after that ! – Guillaume Sep 17 '18 at 09:45