1

My question relates to databases (and in particular SQL Server): in the official guide, it is mentioned that when using "NVARCHAR/NCHAR", "2 bytes of storage per character" is used and "if a surrogate pair is needed, a character will require 4 bytes of storage." How is 4-byte used when surrogate pair is needed? how is that "need" going to be communicated with SQL Server and how would it know? I'm just not sure how this is going to work out - when I was programming, I'd either define something as UTF-8, 16 or 32. It seems like SQL Server only accepts UTF-16 and it'll somehow uses surrogate pair when needed. Could someone please explain to me how this is supposed to work? Also, I'd really really appreciate sources and references so I could study more on it.

I tried reading about surrogate pairs and there is quite literally nothing out there except some sources that just touch the surface and explain that "surrogate pair is just a mechanism for represeinting UTF-32 characters using two UTF-16s".

Thank you very much and sorry about the lengthy question.

LearnByReading
  • 1,813
  • 4
  • 21
  • 43

1 Answers1

1

Okay, sometimes it's best to do your own research and find the answer (even though that may take many hours over many days). Anyways, I found the answer to my question.

To put simply, UCS-2, predecessor to UTF-16, was a FIXED-LENGTH encoding. This means that ALL characters in UCS=-2 take up exactly 2-bytes. UTF-16 was introduced after UCS-2, which was in contrast a variable-length encoding. This meant that UTF-16, through surrogate pairing, would allow one to define characters that take up 32 bits instead of 16. How is this done? There exists a range IN THE UTF-16 encoding that is reserved for pairing. This means that any encoding that uses this range (which happens to be 1024 spots) is automatically assumed to be waiting for a pair.

So, at this juncture, you may ask "What happens if I have USC-2 encoding and my program sees a character in that forbidden range". The answer is simply "Nothing". That range is not defined by UCS-2, and that actually is the only difference between UTF-16 and UCS. A UCS-bound program will simply not recognize UTF-16 specific characters.

LearnByReading
  • 1,813
  • 4
  • 21
  • 43