How to convert wchar_t to multi-bytes char in C

Question

I'm looking for a way to convert wchar_t to multi-bytes char, without using wctomb or any ready-made routine. I have to do that in C, not C++, and the interoperability doesn't matter here.

My goal is to print wchar byte by byte using the write syscall. For example, the 'é' character is equivalent to 0xe9 encoded into a wchar, and is equivalent to ff ff ff c3 ff ff ff a9 in its multi-bytes form. Ho can I switch from one form to the other?

Thanks in advance.

`I'm looking for a way to convert wchar_t to multi-bytes char, without using wctomb` Do you know the encoding used to store `wchar_t` and the encoding used for multibyte string? — KamilCuk, Jan 12 '21 at 11:07
No, I don't know the encoding used, how can I find that? I'm on Debian, using gcc. — Kap Merang, Jan 12 '21 at 11:11
`using gcc` Read it's documentation. `or any ready-made routine` Would be extremely hard. — KamilCuk, Jan 12 '21 at 11:15
It's for a school project, so I guess is not that hard once you know the trick. — Kap Merang, Jan 12 '21 at 11:27
It very much sounds as if you are expected to do a UTF-32 (aka UCS-4) to UTF-8 conversion. That's straight-forward to implement. — Codo, Jan 12 '21 at 12:37

score 0 · Answer 1 · answered Jan 12 '21 at 11:44

I'm looking for a way to convert wchar_t to multi-bytes char, without using wctomb or any ready-made routine

This is the same as conversion between any two encodings. First determine the encoding used to encode characters in source and destination, then translate characters from one encoding to another.

So first wchar_t - it's encoding is (or should be) constant and determined by your compiler and environment. So read about your environment and about your compiler. You specified Debian, using gcc then read gcc documentation and nowadays on linux wchar_t is meant to represent one UCS-4 "character". Note that on windows wchar_t is UTF-16.

Then determine the destination encoding, the encoding of the multi-byte string - it depends on locale. Read and parse LC_CTYPE locale, you might want read posix locale and about locale naming. Then because of without using any ready-made routine in the sad case when the locale doesn't specify codeset, you have to write your own platform-specific parser for locale specific files and infer the default character encoding for specific current locale (I am not really sure how it happens here, you have to find "the locale language category"). Pages like man 7 locale man 7 charsets look like a good read.

Then after determining the destination and source encodings, you need to write a routine that will translate one encoding to another. Because of without using any ready-made routine you don't want to use iconv, that means you have to write it yourself. That goes to reading specification of both encodings and what characters are represents by what codepoints in these encodings and then deciding how to translate each and every codepoint from one encoding to another.

All in all, another projects source code, like glibc source code or libiconv or libunistring might be sources of inspiration.

It's for a school project, so I guess is not that hard once you know the trick.

Most probably the multibyte encoding is UTF-8, unicode is dominating todays world. As such, you'll want to research how to convert a UTF-32 to UTF-8, which is actually a simple routine.

How to convert wchar_t to multi-bytes char in C

1 Answers1