UnicodeString to char* (UTF-8)

Question

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.

UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.

score 7 · Accepted Answer · answered Jun 30 '10 at 17:31

7

call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).

answered Jun 30 '10 at 17:31

Steven R. Loomis

4,228
28
39

1

Thank you! That does work. I'm not sure about the destCapacity argument and the length of the UnicodeString. This code works: http://codepad.org/blaSP0ex but you'll notice I double the .length() of the UnicodeString manually to make up for the multibyte string. How can I make sure there is enough space in my char* dest? – zfedsa Jun 30 '10 at 18:55
http://icu-project.org/apiref/icu4c/classUnicodeString.html#125255f27efd817e38806d76d9567345 It will return the length needed for the output string and a U_BUFFER_OVERFLOW_ERROR in status if there wasn't enough space. See http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs%2e – Steven R. Loomis Jul 01 '10 at 00:04
Thank you. The documentation says that it's best to guess the size and if there's a buffer overflow error, then to call the extract function again with the length returned from the first call. I do that here: http://codepad.org/nyp5yJWB but the second call still fails, even though I provide it with the correct length returned from the first extract call. What am I doing wrong? – zfedsa Jul 01 '10 at 14:37
I forgot delete[] instead of delete, and I don't think I need sizeof (I use C usually), but those are minor details. – zfedsa Jul 01 '10 at 15:03
That's right, but you need to reset the error code after the failure. ICU functions just exit if the error is already set. http://userguide.icu-project.org/design#TOC-Error-Handling – Steven R. Loomis Jul 01 '10 at 21:46
Thank you! Everything works now. I don't mean to keep pestering you, but it just seems like you're the only one who knows anything about ICU. – zfedsa Jul 02 '10 at 08:58
You might want to amend this answer; while correct at the time of writing, since ICU 4.2 we have more comfortable solutions (as pointed out by the other two answers). FWIW, I found this answer while looking for a *legacy* solution, so have an upvote. ;-) – DevSolar May 20 '16 at 13:00
extract didn't work for me. Always some error. I need to log a value in a file, i used charAt(index): log_statement_orig.charAt(0)<< "\n"; log_statement_orig.charAt(1)<< "\n"; ... seems stupid but it helped. I wrote the output integers in a unicode converter (from decimal to unicode) and I had the value of the string. Only the build of my project lasts 30 minutes and after few errors i decided to follow this way. Of course it was only for a debug, not usable in production code. – fresko Apr 16 '20 at 11:16

score 4 · Answer 2 · answered Apr 06 '14 at 05:58

ICU User Guide > UTF-8 provides methods and descriptions of doing that.

The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).

And extract() is not prefered now.

Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.

score 3 · Answer 3 · edited Oct 24 '13 at 13:40

3

This will work:

std::string utf8;
uStr.toUTF8String(utf8);

edited Oct 24 '13 at 13:40

Bucket

7,415
9
35
45

answered Oct 23 '13 at 23:54

gsf

6,612
7
35
64

This is an old post, as you can see. Since then I am working on `go` and `java`. `std::string` should be taken care, but I do not remember what `icu` ownership was for `uStr` – gsf Aug 09 '18 at 16:50
@Johnny_D, `std::string` always contains custom array of char. So, don't worry about `std::string utf8;`, it will destruct custom copy of string array. – Mister_Jesus Sep 26 '19 at 16:16

UnicodeString to char* (UTF-8)

3 Answers3

Linked