How to count length of string in bytes with unicode characters of more than 1 byte?

Question

Because a string in C can contain unicode characters of several bytes, where one of the bytes may be a terminating \0 character, I don't think strlen works well when it comes to counting how many bytes there is in such a string.

How to count the length in bytes of such a string properly? I'm not the one allocating the memory for it, but rather I use the property char d_name[256] of the struct dirent in the library dirent.h. Is there any way to see how long the string names are besides just copying the entire 256 bytes? What if I couldn't just have copied the 256 bytes?

As I said in [your previous question](http://stackoverflow.com/a/27087022/1009479), it's not a problem to UTF-8, so what encoding are you using? — Yu Hao, Nov 23 '14 at 09:19
@YuHao I think I made it somewhat clearer here, when I said where I get the string from. — Horse SMith, Nov 23 '14 at 09:34
You're misunderstanding Unicode and unicode encodings like UTF-8, UTF-16 and UTF-32. Read [Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) and [Unicode, UTF-8 and character encodings: What every developer should know](http://www.teknically-speaking.com/2014/02/unicode-utf-8-and-character-encodings_23.html). There's no Unicode strings but strings encoded in some Unicode encodings — phuclv, Nov 23 '14 at 09:48

score 3 · Accepted Answer · answered Nov 23 '14 at 09:26

What do you mean by unicode? If it's UTF-8 (dirent.h is a part of POSIX API, so it should be UTF-8), it can't contain '\0' in the middle. strlen will do exactly what you need. If you are using some non-standard version of dirent (maybe some strange port for Windows) with UTF-16, you may use appropriate wide-character string functions.

How to count length of string in bytes with unicode characters of more than 1 byte?

1 Answers1