4

I'm working with SQL Server database; database instance encoding is "SQL_Latin1_General_CP1_CI_AS".

The following code:

UPDATE ...
SET field = CHAR(136)
WHERE...

puts in field the following symbol: ˆ

But! In Latin1 code table 127-159 codes are just not defined! How comes it inserts this symbol?

And what's more confusing, when I read this field value to string variable in C# and convert it to char, I get the code 710 instead of 136.

I tried to use encoding conversion:

var latin1Encoding = Encoding.GetEncoding("iso-8859-1");
var test = latin1Encoding.GetBytes(field); // field is a string read from db

But in this case I get code 94 which is ^ (looks similar, but it's not the same, and I need exactly same).

Otiel
  • 18,404
  • 16
  • 78
  • 126
Nat
  • 467
  • 1
  • 6
  • 12
  • 2
    If you want things to be exactly the same, I think you should use the binary collation. Conversion across different encoding is always lossy. – Todd Li Oct 25 '11 at 16:47

2 Answers2

5

But! In Latin1 code table 127-159 codes are just not defined!

In ISO-8859-1, character 136 is defined, but it's a seldom-used and largely pointless control character.

But SQL_Latin1_General_CP1_CI_AS, despite the name “Latin1”, is not ISO-8859-1. It's the Western European ANSI code page, 1252, which is similar to ISO-8859-1, but has a bunch of different symbols in the range 128–159.

Character 136 in code page 1252 is U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT, ˆ; decimal code point number 710.

in this case I get code 94 which is ^

Yes, you're asking for a conversion to ISO-8859-1, which doesn't include the character U+02C6, so you get the “best fit fallback”, which is a character that looks a bit like that one you wanted. This is usually a bad thing; many of the fallbacks chosen are highly questionable. You can change this behaviour using an EncoderFallback, for example to throw an exception instead.

bobince
  • 528,062
  • 107
  • 651
  • 834
0

Okay, there's several conversion taking place here.

  1. When you use Char(136) the number is an ASCII code, but since the number 136 is outside the standard ASCII set the character you get is the one defined by Windows-1252. That character is the circumflex.
  2. In addition to defining the encoding of non-unicode columns, the collation also establishes some rules for translating between non-unicode characters and unicode ones when attempting to store the non-unicode character in a unicode field. If no conversion is defined you'll tend to get a ?, but in this case you get the character with the unicode code-point U+02C6. The important thing to appreciate is that the collation establishes an equivalence between the characters because it was decided that they are similar/equivalent. It has nothing to do with the actual values.
  3. Finally, you used the iso-8859-1 encoding to get the numeric code of the circumflex in that encoding which is 94.
Steve Rowbotham
  • 2,868
  • 17
  • 18