21

So here's the deal: I'm trying to open a file (from bytes), convert it to a string so I can mess with some metadata in the header, convert it back to bytes, and save it. The problem I'm running into right now is with this code. When I compare the string that's been converted back and forth (but not otherwise modified) to the original byte array, it's unequal. How can I make this work?

public static byte[] StringToByteArray(string str)
{
    UTF8Encoding encoding = new UTF8Encoding();
    return encoding.GetBytes(str);
}

public string ByteArrayToString(byte[] input)
{
    UTF8Encoding enc = new UTF8Encoding();
    string str = enc.GetString(input);
    return str;
}

Here's how I'm comparing them.

byte[] fileData = GetBinaryData(filesindir[0], Convert.ToInt32(fi.Length));
string fileDataString = ByteArrayToString(fileData);
byte[] recapturedBytes = StringToByteArray(fileDataString);
Response.Write((fileData == recapturedBytes));

I'm sure it's UTF-8, using:

StreamReader sr = new StreamReader(filesindir[0]);
Response.Write(sr.CurrentEncoding);

which returns "System.Text.UTF8Encoding".

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Brian Hicks
  • 6,213
  • 8
  • 51
  • 77
  • 6
    are you sure its UTF-8 to start with? – Mitch Wheat Sep 14 '09 at 15:34
  • I'm unsure. How would I tell if it is or not? – Brian Hicks Sep 14 '09 at 15:46
  • What do you mean, it's unequal? Your string are unequal? you don't get the same string result? – Khan Sep 14 '09 at 16:05
  • Are you basically looking for a Hex Editor? – Neil N Sep 15 '09 at 21:34
  • 3
    A few comments: • Instead of `new UTF8Encoding()`, use `Encoding.UTF8` so you don’t have to instantiate a new object every time; • Instead of your `GetBinaryData` followed by `ByteArrayToString`, you can just use `File.ReadAllText()`; • The code you posted at the bottom (with `StreamReader`) doesn’t tell you anything about the contents of the file. It will always say `UTF8Encoding` unless you specify a different encoding in the `StreamReader` constructor. – Timwi Jul 30 '10 at 22:45

4 Answers4

16

Try the static functions on the Encoding class that provides you with instances of the various encodings. You shouldn't need to instantiate the Encoding just to convert to/from a byte array. How are you comparing the strings in code?

Edit

You're comparing arrays, not strings. They're unequal because they refer to two different arrays; using the == operator will only compare their references, not their values. You'll need to inspect each element of the array in order to determine if they are equivalent.

public bool CompareByteArrays(byte[] lValue, byte[] rValue)
{
    if(lValue == rValue) return true; // referentially equal
    if(lValue == null || rValue == null) return false; // one is null, the other is not
    if(lValue.Length != rValue.Length) return false; // different lengths

    for(int i = 0; i < lValue.Length; i++)
    {
        if(lValue[i] != rValue[i]) return false;
    }

    return true;
}
Adam Robinson
  • 182,639
  • 35
  • 285
  • 343
  • I've edited the question to show how... the code doesn't show up right in the comment! – Brian Hicks Sep 14 '09 at 15:45
  • I tried this, they return that they're not of the same length. It must be somewhere else. – Brian Hicks Sep 14 '09 at 16:00
  • 3
    Take a look at the documentation for the UTF8 encoding. There is an option as to whether or not to specify the preamble. If you're finding that your generated byte array is longer than the original, then that is likely your issue. Again, you need to make sure that UTF8 is, in fact, the right encoding. As to how you can tell, you would have to ask whoever is supplying you with the data. – Adam Robinson Sep 14 '09 at 16:05
7

When you have raw bytes (8-bit possibly-not-printable characters) and want to manipulate them as a .NET string and turn them back into bytes, you can do so by using

Encoding.GetEncoding(1252)

instead of UTF8Encoding. That encoding works to take any 8-bit value and convert it to a .NET 16-bit char, and back again, without losing any information.

In the specific case you describe above, with a binary file, you will not be able to "mess with metadata in the header" and have things work correctly unless the length of the data you mess with is unchanged. For example, if the header contains

{any}{any}ABC{any}{any}

and you want to change ABC to DEF, that should work as you'd like. But if you want to change ABC to WXYZ, you will have to write over the byte that follows "C" or you will (in essence) move everything one byte further to the right. In a typical binary file, that will mess things up greatly.

If the bytes after "ABC" are spaces or null characters, there's a better chance that writing larger replacement data will not cause trouble -- but you still cannot just replace ABC with WXYZ in the .NET string, making it longer -- you would have to replace ABC{whatever_follows_it} with WXYZ. Given that, you might find that it's easier just to leave the data as bytes and write the replacement data one byte at a time.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
J.Merrill
  • 1,233
  • 14
  • 27
  • If one has an array of bytes and wishes to replace all occurrences of a particular sequence with another sequence of a different length (e.g. replace all occurrences of {0x7D,0x5E} with {0x7E}), would converting to string, using `String.Replace`, and then converting back be a reasonable approach? Would the aforementioned encoding replace each byte value 0-255 with its corresponding same-numbered code [the fact that the encoding is lossless wouldn't by itself imply that]? – supercat Oct 17 '12 at 00:33
  • @supercat -- yes that approach (provided you use 1252 encoding) would work. But you'd still not be able to do that with most binary file formats for the reasons mentioned in my message. – J.Merrill Oct 17 '12 at 18:48
  • If one is using position-sensitive formats one would obviously have to ensure that things that aren't supposed to move, don't. Even then, there would be cases where `String.Replace` would seem useful if the "original" and "replacement" strings are the same length. – supercat Oct 17 '12 at 19:25
  • Thanks J.Merrill. Its work perfectly. i was looking exactly like this. – Dips Sep 13 '13 at 02:04
  • Thanks J. Merrill -- I had this problem and your answer was exactly what I needed. – Somik Raha Sep 14 '17 at 19:37
5

Due to the fact that .NET strings use Unicode strings, you can no longer do this like people did in C. In most cases, you should not even attempt to go back and forth from string<->byte array unless the contents are actually text.

I have to make this point clear: In .NET, if the byte[] data is not text, then do not attempt to convert it to a string except for the special Base64 encoding for binary data over a text channel. This is a widely-held misunderstanding among people that work in .NET.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • 4
    String<->byte[] conversions should generally be done through one of the System.Text.Encoding classes, not the BitConverter class. BitConverter.ToString converts a byte array into a hexadecimal string representation of the numbers, it does **not** convert a byte array into a string. – Adam Robinson Sep 14 '09 at 16:07
  • 1
    Heh, I should have removed that line once I knew it wasn't the point of my post. – Sam Harwell Sep 14 '09 at 17:28
3

Your problem would appear to be the way you're comparing the array of bytes:

Response.Write((fileData == recapturedBytes));

This will always return false since you're comparing the address of the byte array, not the values it contains. Compare the string data, or use a method of comparing the byte arrays. You could also do this instead:

Response.Write(Convert.ToBase64String(fileData) == Convert.ToBase64String(recapturedBytes));
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
csharptest.net
  • 62,602
  • 11
  • 71
  • 89