9

I am using the ICU library in C++ on OS X. All of my strings are UnicodeStrings, but I need to use system calls like fopen, fread and so forth. These functions take const char* or char* as arguments. I have read that OS X supports UTF-8 internally, so that all I need to do is convert my UnicodeString to UTF-8, but I don't know how to do that.

UnicodeString has a toUTF8() member function, but it returns a ByteSink. I've also found these examples: http://source.icu-project.org/repos/icu/icu/trunk/source/samples/ucnv/convsamp.cpp and read about using a converter, but I'm still confused. Any help would be much appreciated.

afuzzyllama
  • 6,538
  • 5
  • 47
  • 64
zfedsa
  • 93
  • 1
  • 1
  • 3

3 Answers3

7

call UnicodeString::extract(...) to extract into a char*, pass NULL for the converter to get the default converter (which is in the charset which your OS will be using).

Steven R. Loomis
  • 4,228
  • 28
  • 39
  • 1
    Thank you! That does work. I'm not sure about the destCapacity argument and the length of the UnicodeString. This code works: http://codepad.org/blaSP0ex but you'll notice I double the .length() of the UnicodeString manually to make up for the multibyte string. How can I make sure there is enough space in my char* dest? – zfedsa Jun 30 '10 at 18:55
  • http://icu-project.org/apiref/icu4c/classUnicodeString.html#125255f27efd817e38806d76d9567345 It will return the length needed for the output string and a U_BUFFER_OVERFLOW_ERROR in status if there wasn't enough space. See http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs%2e – Steven R. Loomis Jul 01 '10 at 00:04
  • Thank you. The documentation says that it's best to guess the size and if there's a buffer overflow error, then to call the extract function again with the length returned from the first call. I do that here: http://codepad.org/nyp5yJWB but the second call still fails, even though I provide it with the correct length returned from the first extract call. What am I doing wrong? – zfedsa Jul 01 '10 at 14:37
  • I forgot delete[] instead of delete, and I don't think I need sizeof (I use C usually), but those are minor details. – zfedsa Jul 01 '10 at 15:03
  • That's right, but you need to reset the error code after the failure. ICU functions just exit if the error is already set. http://userguide.icu-project.org/design#TOC-Error-Handling – Steven R. Loomis Jul 01 '10 at 21:46
  • Thank you! Everything works now. I don't mean to keep pestering you, but it just seems like you're the only one who knows anything about ICU. – zfedsa Jul 02 '10 at 08:58
  • You might want to amend this answer; while correct at the time of writing, since ICU 4.2 we have more comfortable solutions (as pointed out by the other two answers). FWIW, I found this answer while looking for a *legacy* solution, so have an upvote. ;-) – DevSolar May 20 '16 at 13:00
  • extract didn't work for me. Always some error. I need to log a value in a file, i used charAt(index): log_statement_orig.charAt(0)<< "\n"; log_statement_orig.charAt(1)<< "\n"; ... seems stupid but it helped. I wrote the output integers in a unicode converter (from decimal to unicode) and I had the value of the string. Only the build of my project lasts 30 minutes and after few errors i decided to follow this way. Of course it was only for a debug, not usable in production code. – fresko Apr 16 '20 at 11:16
4

ICU User Guide > UTF-8 provides methods and descriptions of doing that.

The simplest way to use UTF-8 strings in UTF-16 APIs is via the C++ icu::UnicodeString methods fromUTF8(const StringPiece &utf8) and toUTF8String(StringClass &result). There is also toUTF8(ByteSink &sink).

And extract() is not prefered now.

Note: icu::UnicodeString has constructors, setTo() and extract() methods which take either a converter object or a charset name. These can be used for UTF-8, but are not as efficient or convenient as the fromUTF8()/toUTF8()/toUTF8String() methods mentioned above.

Map X
  • 444
  • 1
  • 4
  • 14
3

This will work:

std::string utf8;
uStr.toUTF8String(utf8);
Bucket
  • 7,415
  • 9
  • 35
  • 45
gsf
  • 6,612
  • 7
  • 35
  • 64
  • This is an old post, as you can see. Since then I am working on `go` and `java`. `std::string` should be taken care, but I do not remember what `icu` ownership was for `uStr` – gsf Aug 09 '18 at 16:50
  • @Johnny_D, `std::string` always contains custom array of char. So, don't worry about `std::string utf8;`, it will destruct custom copy of string array. – Mister_Jesus Sep 26 '19 at 16:16