1

I want to convert Windows-1252 text to UTF-8. The Windows-1252 text can contain invalid bytes (e.g. 0x90). I need to replace invalid bytes with a replacement character ('?').

Example: "a[0x90]b" (Windows-1252) -> "a?b" (UTF-8)

I tried with 'UTF-8//TRANSLIT' but iconv() stopps with an error ('Invalid or incomplete multibyte or wide character'). 'UTF-8//IGNORE' does work but removing invalid chars is not what i want.

I use iconv from the standard C library in C++.

Mabye someone can give me a hint.

used code:

//char* res=std::setlocale(LC_ALL, "de_DE"); //no effect

const iconv_t iconv_handle = iconv_open("UTF-8//TRANSLIT", "WINDOWS-1252");
assert(iconv_handle != (iconv_t)-1);

const char input_text[] = "a\x90\b"; //'a', 'invalid byte', 'b'
std::array<char, _countof(input_text) * 4> utf8_result_buffer{};

char* in = (char*)input_text;
char* out = (char*)utf8_result_buffer.data();
size_t srclen = strlen(input_text);
size_t outbytesleft = utf8_result_buffer.size();

const size_t iconv_res = iconv(iconv_handle, &in, &srclen, &out, &outbytesleft);
if (iconv_res == (size_t)-1)
{
 perror("iconv");
}

//result: utf8_result_buffer: "a" and 'Invalid or incomplete multibyte or wide character' error
iconv_close(iconv_handle);
pulp
  • 698
  • 1
  • 6
  • 13
  • Go through the input string and replace invalid characters with `?` before passing it to `iconv`. – Paul Sanders May 28 '21 at 15:55
  • Do a pre-pass and replace the `undefined` characters with `?` see https://en.wikipedia.org/wiki/Windows-1252 – Richard Critten May 28 '21 at 15:56
  • @PaulSanders: I choose Windows-1252 to simply the question. In the real world scenario the input codepage/encoding is choose by the user from all available codepages/encodings. – pulp May 28 '21 at 16:20
  • `//TRANSLIT` is the *documented* way to do what you want, but if it is not working for UTF-8 then it is probably a bug in iconv. Searching online, I see a few other references to errors when using `UTF-8//TRANSLIT` – Remy Lebeau May 28 '21 at 17:10

0 Answers0