8

I need to handle unicode strings in C. I have heard that ICU is the appropriate set of libraries to use but I am not having any luck getting started.

So my question: Can anyone provide a link to a good beginners tutorial on using unicode strings with ICU in C

P.S. I have installed libicu44 (under Ubuntu 11.04).

jsj
  • 9,019
  • 17
  • 58
  • 103
  • 2
    What do you need to do with unicode strings? If you just need to store them and spit them back out, you don't need any library whatsoever. ICU is useful for things like changing normalization forms, capitalization, line breaking, etc. – R.. GitHub STOP HELPING ICE Sep 03 '11 at 16:50
  • @R I need to take arbitary input from a file or the console and insert characters such as ā á ǎ à ō ó ǒ ò ē é ě è ī í ǐ ì ū ú ǔ ù ǖ ǘ ǚ ǜ ü Ā Á Ǎ À Ō Ó Ǒ Ò Ē É Ě È. along the lines of someString[5] = "ǜ"; – jsj Sep 03 '11 at 18:27
  • 2
    This probably doesn't need ICU, just some simple `memmove` and `memcpy`... – R.. GitHub STOP HELPING ICE Sep 03 '11 at 18:37
  • @R wouldn't I end up with mixed encoding? How would printf(for example) know not to print "ní hǎo" as "n/123/123/123 h/123/123/o"? printf("Ā Á Ǎ"); works fine but I have never managed to have mixed characters print. – jsj Sep 04 '11 at 03:38
  • 3
    `printf` works with whatever bytes you feed it; it does not care about character encoding except in the format string, which is required to be valid in the locale's encoding. In any case you should be using a locale with UTF-8 as the encoding... This is 2011 not 1991. – R.. GitHub STOP HELPING ICE Sep 04 '11 at 03:43
  • @R lol i've been compiling with -ansi so I guess I haven't even made it to 9991. But anyway I think I need to learn more about the basics of this... are you saying that with a UTF-8 locale (as i have of course) I could declare char ni = "你"; I was under the naive impression that a char was one byte. And further ("你好吗?"[0] == "你") ? – jsj Sep 04 '11 at 03:51
  • 2
    A `char` is one byte, yes (that's the definition of byte in C), but an array of `char`s can hold all the bytes of one or more UTF-8 characters. If you always work with strings instead of individual characters, the transition to unicode will come **a lot** easier! This advice stands regardless of what encoding, libraries, etc. you're using. – R.. GitHub STOP HELPING ICE Sep 04 '11 at 03:53

1 Answers1

7

ICU Reference Documentation

Tutorial

Sadique
  • 22,572
  • 7
  • 65
  • 91