26

The documentation and language varies between VS 2008 and 2010:


VS 2008 Documentation

Internally, the text is stored as a readonly collection of Char objects, each of which represents one Unicode character encoded in UTF-16. ... The length of a string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=vs.90%29.aspx


VS 2010 Documentation

Internally, the text is stored as a sequential read-only collection of Char objects. ... The Length property of a string represents the number of Char objects it contains, not the number of Unicode characters. To access the individual Unicode code points in a string, use the StringInfo object. - http://msdn.microsoft.com/en-us/library/ms228362%28v=VS.100%29.aspx

The language used in both cases doesn't clearly differentiate between "character", "Unicode character", "Char class", "Unicode surrogate pair", and "Unicode code point".

The language in the VS2008 documentation stating that a "string represents the number of characters regardless of whether the characters are formed from Unicode surrogate pairs or not" seems to be defining "character" as as object that may be the result of a Unicode surrogate pair, which suggests that it may represent a 4-byte sequence rather than a 2-byte sequence. It also specifically states at the beginning that a "char" object is encoded in UTF-16, which suggests that it could represent a surrogate pair (being 4 bytes instead of 2). I'm fairly certain that is wrong though.

The VS2010 documentation is a little more precise. It draws a distinction between "char" and "Unicode character", but not between "Unicode character" and "Unicode code point". If a code point refers to half a surrogate pair, and a "Unicode character" represents a full pair, then the "Char" class is named incorrectly, and does not refer to a "Unicode character" at all (which they state it does not), and it's really a Unicode code point.

So are both of the following statements true? (Yes, I think.)

  1. String.Length represents the Unicode code-point length, and
  2. String.Length represents neither the Unicode character length nor what we would consider to be a true character length (number of characters that would be displayed), but rather the number of "Char" objects, which each represent a Unicode code point (not a Unicode character).
John Saunders
  • 160,644
  • 26
  • 247
  • 397
Triynko
  • 18,766
  • 21
  • 107
  • 173

3 Answers3

32

String.Length does not account for surrogate pairs; however, the StringInfo.LengthInTextElements method does.

StringInfo.SubstringByTextElements is similar to String.Substring, but it operates on "Text Elements", such as surrogate pairs and combining characters, as well as normal characters. The functionality of both these methods are based on the StringInfo.ParseCombiningCharacters method, which extracts the starting index of each text element and stores them in a private array.

"The .NET Framework defines a text element as a unit of text that is displayed as a single character, that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence." - http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx

Triynko
  • 18,766
  • 21
  • 107
  • 173
  • 2
    Pure genius. Should add a code sample so people know how to get it. I did `new StringInfo(str).LengthInTextElements;`. It works, but I'm not sure that's the best option. – gregsdennis Aug 23 '16 at 10:29
22

String.Length does not account for surrogate pairs, it only counts UTF-16 chars (i.e. chars are always 2 bytes) - surrogate pairs are counted as 2 chars.

Ana Betts
  • 73,868
  • 16
  • 141
  • 209
  • 22
    WOW. Do you realize what this means? All this time anyone has been meaning to write code to manipulate "CHARACTERS" (uppercase them, count them, swap them, etc.) using basic String and Char class methods, they've actually been writing code to manipulate "16-bit chunks of data"... and their code will BREAK upon encountering characters outside the Basic Multilingual Plane. It's no wonder software blows up and data gets garbled :( – Triynko Apr 13 '11 at 22:55
  • 7
    Yep. UTF16 in general was a dumb idea (all the disadvantages of UTF8 but none of the advantages), but it's historically what we're stuck with. "64K of characters should be enough for anyone!" – Ana Betts Apr 13 '11 at 23:11
  • 5
    Anytime you want to manipulate Unicode data, the accurate thing to do is convert from UTF-16 (or UTF-8 or whatever you are using) to UTF-32 first, do the mods on the now-fully-decoded Unicode codepoint(s) as needed, then convert back to UTF-16 (or whatever). – Remy Lebeau Apr 13 '11 at 23:49
  • 3
    What a mess. I like that SQL Server is bold enough to say screw everything outside the BMP since the characters are so uncommon and just call it UCS-2. Honestly, if you're not going to support characters outside the BMP, you may as well implement a custom fallback decoder (best-fit or replacement) to ensure you're only dealing with stuff in the BMP and eliminate all occurrences of surrogate pairs to ensure that kind of data doesn't pass through your application which doesn't support them. Filtering the input this way at the entry point to the application is simpler than rewriting the app. – Triynko Apr 14 '11 at 21:21
  • 1
    Isn't UTF-16 a variable length encoding? http://en.wikipedia.org/wiki/UTF-16/UCS-2 – Razor Oct 03 '11 at 07:13
  • Yes, the way I phrased things was a bit confusing - the UTF-16 "Char" data type is always 2 bytes. You're correct that a UTF-16 character can be > 1 "Char" – Ana Betts Oct 03 '11 at 07:21
1

Both i would consider false. The second question would be true if you'd ask about the count of unicode codepoints but you asked about "length". The String's Length is the count of its elements which are words. Just in case that there are only unicode codepoints from the BMP (Basic Multilingual Plane) within the string, the length is equal to the number of unicode characters/codepoints. If there are codepoints from beyond the BMP or orphaned surrogates (high- or low-surrogates that do not appear as ordered pair) the length is NOT equal to the number of characters/codepoints.

First of all, the String is a bunch of words, a word list, word array or word stream. Its content are 16 bit words and that's it. To name an element "char" or "wchar" is a sin regarding unicode characters. Because a unicode character can have a codepoint greater than 0xFFFF it cannot be stored in a type that is 16 bits wide and if this type is called char or wchar it's even worse because it can only ever hold codepoints limited to 0xFFFF which accords to the unicode 1.0 standard which nowerdays is 20 years old. In order to store even the highest possible unicode codepoint in a single datatype, this type should have 21 bits but there is no such type, so we'd use a 32 bit type. In fact there is a static method (of the char class !) that is named ConvertToUtf32() which does just this, it can return a low ASCII codepoint or even the highest unicode codepoint whereby the latter implies that this method can detect a surrogate pair within the position of a String.

brighty
  • 406
  • 3
  • 10
  • Basically, a "text element" can be a base character (may or may not be represented as a surrogate pair) or a combining character sequence (each of which may or may not be represented as a surrogate pair)." http://msdn.microsoft.com/en-us/library/vstudio/8k5611at(v=vs.100).aspx "char" makes sense when it represents the chunk size of a variable-length encoding, especially in UTF16 where the rarity of its "surrogate pair" justifies equating the chunk size with a character. Calling an 8-bit chunk a "char" in UTF8, which frequently uses two or more chunks to form a character is more of a stretch. – Triynko Aug 09 '13 at 06:16
  • Yes, a combining character using a diacritical character is "read" as one character, but each "used" character has its own codepoint. StringInfo will consider this. Still Length on a string just counts the word elements. This property should be renamed to ElementCount to avoid confusion. – brighty Sep 17 '13 at 15:44