0

I'm using iconv's transliterate function to convert a unicode string to the nearest ASCII equivalent. However, the string contains some symbols which do not have an ASCII equivalent. I want to retain such symbols without dropping them.

Currently, here's what I am doing:

iconv_t cd = iconv_open("ASCII//IGNORE//TRANSLIT", "UTF-8");
const char *utf8 = "ç ß ∑ a";

char* in = const_cast<char*>(utf8);
size_t in_bytes = strlen(in);

char buf[BUFSIZ] = {};
char* out = buf;
size_t out_bytes = sizeof(buf);

iconv(cd, &in, &in_bytes, &out, &out_bytes);

printf("%s", buf);

// prints 
c ss  a

How do I configure iconv to produce an output like the following:

c ss ∑

If this is not possible with iconv, is there a way to achieve this programatically otherwise?

Saxtheowl
  • 4,136
  • 5
  • 23
  • 32
jeffreyveon
  • 13,400
  • 18
  • 79
  • 129
  • From the doc: "The iconv function converts one multibyte character at a time" – stark Oct 02 '19 at 16:06
  • That seems like such a weird thing to do though :D What are you going to use this interesting function for? – Ry- Oct 02 '19 at 16:07
  • Well, first of all the command you posted does not produce that output on my machine, but rather it errors out (maybe remove the `//IGNORE`?). Secondly, `iconv` is just a simple command line utility, in a C program you *should* be able to just try and translate each Unicode code-point by itself and see the result. What did you write that didn't work? You should add the relevant C code. – Marco Bonelli Oct 02 '19 at 16:11
  • I've added the actual code. – jeffreyveon Oct 02 '19 at 16:53

1 Answers1

0

iconv does not support the conversion behaviour that you want to see out-of-the-box, because it is a quite odd behaviour: If it's OK to have a ∑ in the output, why would it not have OK to have a ß in the output?

Anyway, you can implement this conversion through a function of your own, that uses iconv, as follows:

  1. Allocate two conversion descriptors:
    iconv_t cd0 = iconv_open("UTF-8", "UTF-8");
    iconv_t cd1 = iconv_open("ASCII//TRANSLIT", "UTF-8");
    
  2. Use a loop that converts part of the string repeatedly, through iconv() with cd1. When the call fails with errno == EILSEQ, you know that it's because of a character that cannot be transliterated to ASCII.
  3. At this point use an iconv() call with cd0, to convert one and only one character. You do this by calling iconv() with in = 1, then if that fails with in = 2, and so on up to in = 4. (If all of these fail, you must have invalid input; your best bet is to skip one input byte and leave a single '?' in output.)
  4. After the no-op conversion of a single character, go back to step 2.
Bruno Haible
  • 1,203
  • 8
  • 8