1

I'm experimenting with some ideas in which algorithms have to work on bits as their smallest unit of information. This is a modular application where the user may rearrange parts of the "pipeline" like a unix shell pipeline. These algorithms do various things like framing, compression, decompression, error checking and correcting; introducing, detecting and removing noise, etc.

Since they work on the bit level, the algorithms may, for example, take 5 bits of input and produce 19 bits of output. The input and output are rarely multiple of bytes.

Working with bit streams in memory and between threads is fine with the help of std::vector<bool>, but I have to retrieve and store this stream of bits from/to somewhere, and preferably it should be possible to do actual command-line pipelines like:

prog1 < bitsource.dat | prog2 -opts | prog3 -opts > bitsink.dat

Or even:

prog1 | prog2 | ssh user@host /bin/sh -c "prog3 | prog4 > /dev/dsp"

The problem is how to serialize these bits efficiently, since the standard streams (stdin and stdout) are byte-oriented. I have to handle situations where the number of bits in the input and output are not multiple of a byte.

Currently, I have a working proof-of-concept that does it by expanding each bit to a byte that is either 0x30 or 0x31 ("0" or "1"). Clearly, this increases the size of data by a factor of eight, consuming 8× more space and bandwidth than necessary. I would like to have these bits packed in a more efficient manner.

One alternative that I'm considering is a protocol that buffers the bits in the output and produces blocks consisting of a Length header followed by ceiling(Length/8) bytes of data, flushing the output whenever appropriate.

But instead of creating a made-up-protocol, I would like to know if someone already had these requirements, what are your experiences, and if there is already some standard protocol for this (serialization of an arbitrary number of bits) that I could use. Perhaps someone already had this problem and is already using some form of encoding that could also be used in this application, to avoid the proliferation of incompatible formats.

Juliano
  • 39,173
  • 13
  • 67
  • 73
  • "Of course, this is very inefficient"? Really? Measuring time, it may actually be pretty good considering how slow some I/O systems are. Have you measured this? Or are are you talking about the 8x1 expansion in size? What kind of efficiency are you looking for? – S.Lott Jan 03 '11 at 17:18
  • @S.Lott: Yes, I'm talking about the eight-fold increase in size. Space (storage), bandwidth (transmission over a network) and processor (conversion during input/output) efficiency are the concerns, specially space and bandwidth. And eight-fold increase is rarely ignorable. If everything goes well, this should be used for a real-time application. I would like to do it well now than have a bottleneck that would mandate a rewrite of everything later. – Juliano Jan 03 '11 at 17:32
  • Please **update** the question to make it perfectly clear. It's your question; you can improve it. A long string of comments is difficult to integrate into a single, clear question. – S.Lott Jan 03 '11 at 17:38
  • @S.Lott: Done. Clarified a little. – Juliano Jan 03 '11 at 17:48

1 Answers1

1

protocol that buffers the bits in the output and produces blocks consisting of a Length header followed by ceiling(Length/8) bytes of data, flushing the output whenever appropriate.

This is typical. There aren't really any alternatives that would be suitably simple.

Serialization of bits -- as bits -- is rare. Bitmap indexes are about the only example that comes to mind.

The Pascal programming language encoded all strings with a length followed by the bytes of the string. You're doing a similar thing except it's bits, not bytes.

This is similar to "run-length encoding" where runs of identical values are replaced by a header and the bytes. The PackBits algorithm, for example, is a simple RLE that provides header plus data. It works at the byte level (not the bit level) but it's essentially the same design pattern.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Although not the answer I was looking for (something standard or that someone already invented before), I went for this path and it worked really well in the end. Thanks for your contribution. – Juliano Feb 13 '11 at 18:27