1

I have an application converted from Python 2 (where strings are essentially lists of bytes) and I'm using a string as a convenient byte buffer.

I am rewriting some of this code in the Boo language (Python-like syntax, runs on .NET) and am finding that the strings have an intrinsic encoding type, such as ASCII, UTF-8, etc. Most of the information dealing with bytes refer to arrays of bytes, which are (apparently) fixed length, making them quite awkward to work with.

I can obviously get bytes from a string, but at the risk of expanding some characters into multiple bytes, or discarding/altering bytes above 127, etc. This is fine and I fully understand the reasons for this - but what would be handy for me is either (a) an encoding that guarantees no conversion or discarding of characters so that I can use a string as a convenient byte buffer, or (b) some sort of ByteString class that gives the convenience of the string class. (Ideally the latter as it seems less of a hack.) Do either of these already exist? (Or are trivial to implement?)

I am aware of System.IO.MemoryStream, but the prospect of creating one of those each time and then having to make a System.IO.StreamReader at the end just to get access to ReadToEnd() doesn't seem very efficient, and this is in performance-sensitive code.

(I hope nobody minds that I tagged this as C# as I felt the answers would likely apply there also, and that C# users might have a good idea of the possible solutions.)

EDIT: I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?

Kylotan
  • 18,290
  • 7
  • 46
  • 74
  • 1
    you can use GetBuffer() on a memory stream to get it as a byte array directly, without having to read it. is that what you were looking for? – Can Gencer Apr 21 '11 at 16:25
  • It's certainly better than the StreamReader approach! But I'd still prefer not to create MemoryStreams, plus GetBuffer isn't so great if there are unfilled bytes in the buffer, etc. – Kylotan Apr 21 '11 at 16:37
  • in .net, strings are built up from **char**, which are actually two bytes. they don't have an intrinsic encoding type, they're always utf-16. so it will be tricky to use them as a byte buffer. – Can Gencer Apr 21 '11 at 16:48

2 Answers2

4

Use the Latin-1 encoding as described in this answer. It maps values in the range 128-255 unchanged, useful when you want to roundtrip bytes to chars.

UPDATE

Or if you want to manipulate bytes directly, use List<byte>:

List<byte> result = ...
...
// Add a byte at the end
result.Add(b);
// Add a collection of bytes at the end
byte[] bytesToAppend = ...
result.AddRange(bytesToAppend);
// Insert a collection of bytes at any position
byte[] bytesToInsert = ...
int insertIndex = ...
result.InsertRange(insertIndex, bytesToInsert);
// Remove a range of bytes
result.RemoveRange(index, count);
... etc ...

I've also just discovered System.Text.StringBuilder - again, is there such a thing for bytes?

The StringBuilder class is needed because regular strings are immutable, and a List<byte> gives you everything you might expect from a "StringBuilder for bytes".

Community
  • 1
  • 1
Joe
  • 122,218
  • 32
  • 205
  • 338
  • Yeah, it occurred to me that if I just pick an arbitrary 8-bit encoding and use it consistently, that it should work. I guess it'll still waste 1 byte per character though, so I'll see if there are any other alternatives. – Kylotan Apr 21 '11 at 17:13
  • @Kylotan - You can't pick an *arbitrary* 8-bit encoding if you want to roundtrip byte values unchanged. If you want to use a string, you have no option but to lose 1 byte per character. If you want to manipulate bytes directly, I'd suggest a List would meet the case. – Joe Apr 21 '11 at 17:59
  • Why would I lose any characters? If I encode a byte value between 0 and 255 into (for example) ISO-8859-1 and then decode back to bytes later surely I will get in exactly what I put out. But generally I don't *want* to use a string - I just want to get the convenience of the string class when working with bytes: simple appending and extraction for the most part. Or, to find something almost as simple with better performance. – Kylotan Apr 21 '11 at 18:23
  • @Kylotan - "surely I will get in exactly what I put out" - yes that's true of Latin-1 aka ISO-8859-1, but not generally true of "an arbitrary 8-bit encoding". – Joe Apr 22 '11 at 06:03
  • By arbitrary 8-bit encoding, I mean any encoding that only represents 256 characters. They shouldn't mutate any values through a encode/decode round-trip, right? – Kylotan Apr 22 '11 at 12:05
  • @Kylotan "They shouldn't mutate any values through a encode/decode round-trip, right" - no, Encoding.ASCII is a simple counterexample. – Joe Apr 28 '11 at 16:52
  • I consider that a 7 bit encoding! But I see what you're getting at. – Kylotan May 03 '11 at 15:42
  • @Kylotan - "I consider that a 7 bit encoding" - that was just an example. Encoding.Default is an 8-bit encoding that will mutate values through an encode/decode roundtrip too. – Joe May 05 '11 at 18:38
2

I would suggest that you use MemoryStream combined with the GetBuffer() operator to retrieve the end result. Strings are actually fixed length and immutable, and to add or replace one byte to a string requires you to copy the whole thing into a new string, which is quite slow. To avoid this you would need to use a StringBuilder which allocates memory and doubles the capacity when needed, but then you might just as well use MemoryStream instead which does a similar thing but on bytes.

Each element in the string is a char and are actually two bytes because .NET strings are always UTF-16 in memory, which means you will also be wasting memory if you decide to store only one byte in each element.

Can Gencer
  • 8,822
  • 5
  • 33
  • 52
  • Since I'm using it as a variable-sized buffer, copying is going to happen no matter what I do. I'm also unsure whether a MemoryStream is suitable on its own as a buffer because it looks like it might keep growing in capacity even if I consume data from the front of it. – Kylotan Apr 21 '11 at 17:11
  • yes it will grow, unless you create a new one once in a new while and copy the contents that are still relevant to it.. I'm not sure how you are building up your buffer, if it's byte by byte or in chunks. What type of convenience is there in string class that you are missing? – Can Gencer Apr 21 '11 at 17:14
  • @R. Bemrose, yes there are actually about 100k unicode characters in total, so UTF-16 isn't just enough. – Can Gencer Apr 21 '11 at 17:15
  • 1
    +1: You can't stress enough that chars are UTF-16, and are thus 2-bytes per char. For that matter, Strings can have multiple chars per actual displayed character if it is a UTF-8 character larger than two bytes. – Powerlord Apr 21 '11 at 17:15
  • @CanGencer: Whoops, I decided I didn't like the wording on my first comment, reworded it, and reposted it. I guess I didn't think anyone would respond to it before I did. :P – Powerlord Apr 21 '11 at 17:16
  • I'm not particularly concerned about chars being UTF-16, although the space wasted is not ideal. I also know all about variable width encoding etc. That's not the issue here, because I don't care about character encoding. I'm only concerned about the getting the convenience of strings for byte data. – Kylotan Apr 21 '11 at 17:26
  • @Kylotan I'm wonder what convenience are you talking about? is it to be able to add them together or to write x += ? There are methods such as System.Buffer.BlockCopy() which should help you with those.. – Can Gencer Apr 21 '11 at 17:32
  • To add them together, to easily remove data from the beginning, etc. I'm hoping for one-liners, ideally performant ones. Having to manage the array size manually and copy things over each time is a hassle by comparison. – Kylotan Apr 21 '11 at 17:46
  • nowaday using the extension methods you can use things like Skip() or Concat() even on a byte of arrays. I'm not sure how efficient these are though, as they are meant to work on any IEnumerable generically. You can also easily write your own extension methods that use Buffer.BlockCopy and such behind the scenes. – Can Gencer Apr 21 '11 at 17:56