6

I'm curious as to how strlen count unicode characters of multiple bytes in C.

Does it count each byte or character (as they can consist of several bytes) until first '\0'?

Horse SMith
  • 1,003
  • 2
  • 12
  • 25
  • strlen works with bytes. and some unicode characters have '0x00' as the first byte, so 1) strlen is useless for unicode strings. 2) there are available functions for working with multibyte characters: you might want to use one of the wide char functions, like _mbstrnlen() or wcsnlen which is defined in string.h and wchar.h or mbstring.h – user3629249 Nov 23 '14 at 10:39

2 Answers2

7

strlen() counts number of bytes until a \0 is encountered. This holds true for all strings.

For Unicode, note that the return value of strlen() may be affected by the possible existing \0 byte in a valid character other than the null terminator. If UTF-8 is used, it's fine because no valid character other than ASCII 0 can have a \0 byte, but it may not be true for other encodings.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
  • Are you sure there can be conflicts in unicode strings with the \0 character? Will make a new and related question! – Horse SMith Nov 23 '14 at 08:35
  • 5
    It depends on the code set. If you're using UTF-16, then a character such as U+00FF (ΓΏ) will consist of a null byte and a 0xFF byte (in one or the other order, depending on endianness: UTF-16LE or UTF-16BE), and the null byte will stop `strlen()` in its tracks. With UTF-32, the problem occurs with every Unicode character since the maximum value is U+10FFFF, which means there's at least one zero byte in every possible 4-byte Unicode value. UTF-8 carefully avoids this problem; the only time a zero byte shows up is when the character is U+0000. – Jonathan Leffler Nov 23 '14 at 08:40
3

strlen only applies to strings, that is null terminated arrays of char. All multibyte encodings that are permitted inside strings have the property that they contain no internal null bytes, so strlen and other str functions such as strcat work fine.

If by "unicode" you mean arrays of wchar_t then this can contain null bytes, but here again this is no problem, none of the wchar_t elements itself will be null. And you shouldn't apply the str functions to such arrays, they are not defined for them.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177