0

I guess I'm missing something fundamental but I'm really confused by this one and searching has failed to find me anything.

I have the following...

byte[] bytes1;
string string1;
byte[] bytes2;

Then I do the following

bytes1 = { 64, 55, 121, 54, 36, 72, 101, 118, 38, 40, 100, 114, 33, 110, 85, 94, 112, 80, 163, 36, 84, 103, 58, 126 };
string1 = System.Text.Encoding.UTF7.GetString(bytes1);
bytes2 = System.Text.Encoding.UTF7.GetBytes(string1);

Bytes2 ends up as 54 instead of 24 bytes and they are completely different bytes.

Now of course this is pointless code anyway, but I've put it in while diagnosing why the bytes I'm getting from Encoding.UTF7.GetString are not the bytes I'm expecting. I have got down to the fact that this is the reason my code is not giving expected results.

Now I'm confused. I know if I don't use encoding then the result of GetBytes from a string can't be relied on to be a particular set of bytes, but I'm using encoding and still getting this difference.

Can anyone enlighten me to what I'm missing?

EDIT: Conclusion is that it's not UTF7. The original byte array is being written to a varbinary in a database by an application I'm programming in a high level language. I have no control of how the original strings are being encoded to varbinaries in that language. I'm trying to read them and handle them in a small C# add-on to the main app which is where I hit this problem. Other encodings I've tried also don't give the right results.

RosieC
  • 649
  • 2
  • 11
  • 27
  • Whatever those byte values might represent, it is not properly encoded utf7. Garbage in, garbage out. – Hans Passant Nov 10 '14 at 12:50
  • The bytes are what is generated by an application that is written in a high level language that does not have the option to choose encoding when it writes to the varbinary field in MSSQL. – RosieC Nov 10 '14 at 13:00
  • The only encoding that I have found that correctly represents the string (when using GetString) that application is expecting me to read is UTF7. Is the problem that these characters are not valid in UTF7, if so why does GetString show them correctly? – RosieC Nov 10 '14 at 13:01
  • If you found this in a dbase then you can assume with 99.9% confidence that it is *not* utf7. There's a programmer somewhere that can give you an exact answer. You won't find him here, you'll have to pick up the phone. – Hans Passant Nov 10 '14 at 13:16
  • Unfortunately I'm the programmer of the other application. However the high level language does not give a choice of encoding :( I guess I need to try and get on to support of the language but I really bet they can't tell me either (from past experience). – RosieC Nov 10 '14 at 13:23

3 Answers3

2

What you're seeing is two different ways of encoding the same text in UTF-7.

Your original text is:

@7y6$Hev&(dr!nU^pP£$Tg:~

The ASCII version of bytes2 is

+AEA-7y6+ACQ-Hev+ACY-(dr+ACE-nU+AF4-pP+AKMAJA-Tg:+AH4-

In other words, it's encoding everything other than A-Z, a-z, 0-9 as +A...-. That's unnecessary, but I suspect it's valid.

From the UTF-7 wikipedia entry:

Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: ' ( ) , - . / : ?. The direct characters are safe to include literally. The other main group, known as "optional direct characters", contains all other printable characters in the range U+0020–U+007E except ~ \ + and space. Using the optional direct characters reduces size and enhances human readability but also increases the chance of breakage by things like badly designed mail gateways and may require extra escaping when used in encoded words for header fields.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I'm not sure I understand. The text is as you say and is written into a varbinary in MSSQL by an application written in a high level language which unfortunately does not give choice of encoding. Surely if I'm specifying UTF7 in both the conversion of that to a string, and the conversion of the string to bytes I should get the same result. I'm not sure why ASCII comes in to it when I've specified UTF7. UTF7 is the only encoding that gives the right string in GetString(). – RosieC Nov 10 '14 at 13:06
  • I should add that the string you've put as the original text is correct according to the example I'm getting from the other application. UTF7 is the only encoding that gives exactly that from the bytes I'm reading. – RosieC Nov 10 '14 at 13:07
  • @RosieC: Imagine you're trying to store some text within an XML element. Both `I say "Hello"` and `I say "Hello"` are valid representations of the same element - but they're clearly not the same representation. – Jon Skeet Nov 10 '14 at 14:30
  • Thanks Jon, I think I understand now. The fact that UTF7.GetString gave the expected string completely threw me off course from the fact that my bytes were not correctly UTF7. I've now managed to force the other application to write them as Unicode and as expected I'm getting much better results with that. – RosieC Nov 10 '14 at 15:28
2

UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters. (C) Wikipedia

Your byte array contain incorrect sequences for UTF7. For example, number "163" not may encoding by 7 bits.

Evgeniy Mironov
  • 777
  • 6
  • 22
  • Thanks, coming to the conclusion it isn't UTF7. Doesn't help me solve my original problem but stops be chasing down dead ends at least! – RosieC Nov 10 '14 at 13:53
  • For solving problem: var sourceBytes = new byte[]{ 64, 55, 121, 54, 36, 72, 101, 118, 38, 40, 100, 114, 33, 110, 85, 94, 112, 36, 84, 103, 58, 126 }; var unicodeString = System.Text.Encoding.UTF7.GetString(sourceBytes); var bytes = System.Text.Encoding.Unicode.GetBytes(unicodeString).Where((c,i) => i%2==0).ToArray(); Not good code, but work :) – Evgeniy Mironov Nov 10 '14 at 13:58
  • Thanks Evgeniy. I will keep that in mind. I hope to find from the support of the high level language that's writing the bytes in the first place if they can tell me the encoding, but I will come back to this if I get nowhere with that. – RosieC Nov 10 '14 at 14:09
0

It wasn't UTF7 and I had made errors in the first place in coming to the conclusion it was. Thanks everyone who advised this.

I have spoken to someone who works for the people who write the high level language the main part of the application is programmed in (and happens to be in our building today).

He couldn't tell me what encoding it was using between the entered string and the varbinary, but was able to tell me that there was a way to force unicode. As this is a new option in both applications I know that no production data has been written in the old way so will update both sides to use unicode encoding for this process. It all seems to be working so far.

RosieC
  • 649
  • 2
  • 11
  • 27