0

I'm currently working on a project that uses Hunspell inside Node. The goal is a cross-platform spell-checking that works with encoding properly (node-spellchecker).

I have to use arbitrary dictionaries which have different encodings. Most have SET UTF-8 in the *.aff file but other dictionaries have encodings like SET ISO8859-1. I get UTF-8 from Node but I need to convert it into the encoding for the dictionary. Then, I need to convert it in the reverse to handle suggestions.

With Linux, I can use iconv to convert it but I don't have that on the Windows side of things. However, I'd like not to require UTF-8 dictionaries (that works).

Any suggestion or hints of where to start would be greatly appreciated. WideCharToMultiByte is used in one step, but I couldn't find a MultiByteToMultiByte that I would expect.

Things I Have

const char *from_encoding_name = "UTF-8"; // This can be swapped
const char *to_encoding_name = "ISO8859-1"; // This can be swapped
const char *word = /* möchtzn encoded in UTF-8 */;

Things I Want

const char *dictionaryWord = /* möchtzn encoded in ISO-8859-1 */;

Thank you.

dmoonfire
  • 231
  • 3
  • 8
  • 3
    Why not using MultiByteToWideChar to convert from utf-8, then WideCharToMultiByte to convert to the encoding you want? – Michael Chourdakis Dec 19 '18 at 05:32
  • `MultiByteToMultiByte` is a useless function if one of the code pages isn't UTF-8 as all other transforms are lossy. I can't imagine anyone thought it was a good idea to implement a function when pretty much anything it enabled would actually damage data. – Chris Becke Dec 19 '18 at 15:14
  • 1
    @Chris that is not the fault of the function itself. Just because most ANSI charsets support only the small subset of the full Unicode repertoire, and converting between charsets *may* be lossy, does not mean such a function is useless, it just means you have to be careful with it, and accept that not all data will convert without loss. For instance, converting from UTF-8 to ISO-8859-1 is perfectly fine as long as the UTF-8 data uses only codepoints that ISO-8859-1 supports. – Remy Lebeau Dec 19 '18 at 18:39

2 Answers2

2

I don't think that analog MultiByteToMultiByte exists in WinAPI. I'd use two calls: MultiByteToWideChar and then WideCharToMultiByte.

BTW, I looked into sources of .Net method Encoding.Convert and there is also conversion is done through UTF-16.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Boris Nikitin
  • 369
  • 4
  • 12
1

FYI, iconv.exe is available for Windows, you just have to install it manually. Or you can embed libiconv directly in your project.

That being said, what you are asking for can be done using Microsoft APIs:

  1. the Win32 MultiByteToWideChar() and WideCharToMultiByte() functions. First decode the UTF-8 input to UTF-16 using MultiByteToWideChar(CP_UTF8), and then encode the UTF-16 to ISO-8859-1 using WideCharToMultiByte(28591) (or whatever target codepage you need). And then just swap the codepages when going back the other way.

  2. the IMultiLanguage::ConvertString() method, or the IMultiLanguage::CreateConvertCharset() and IMLangConvertCharset::DoConversion() methods. These can convert input from one codepage directly to another.

You can use any of these to implement your own MultiByteToMultiByte() wrapper function.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770