-1

Given that a byte is 8 bits, and a character is 1 byte, Is there any way to manipulate an array of characters (a string), such that we may be able to represent each character in the string into a more compacted number of bits (say 5 bits?)

Weather Vane
  • 33,872
  • 7
  • 36
  • 56
Leeho Lim
  • 19
  • 5
  • Yes, it's possible. Look into the bitwise operators `|`, `&`, `<<`, and `>>` for this purpose. – user4815162342 Feb 23 '15 at 19:57
  • you can subract 64 or mask the 6th bit away. – mch Feb 23 '15 at 19:58
  • Yes there is. You basically pseudocoded the algorithm for yourself already. Google around though. Chances are that the you're looking already exists. – Paul Sasik Feb 23 '15 at 19:58
  • A "character is 1 byte" except when encoded in a form that uses more than 1 byte, E.g. a flavour of Unicode. If you are talking about a subset such as A-Z you need only represent the range 0..25 which can be done with fewer bits, ditto for ASCII if you ignore > 128 – Alex K. Feb 23 '15 at 19:59
  • 2
    Look into information theoretic literature. You should read up on entropy in that context. Depending on the entropy of your message, you may or may not be able to compress it further. – Morten Jensen Feb 23 '15 at 20:05
  • Having represented your values with 5 bits, you still need to *compress* them to gain any saving, because they still sit in an 8-bit byte (subject to `CHAR_BIT`). One way to do that is to pack them so that, say, `output[0]` holds 5 bits of `yourstring[0]` and 3 bits of `yourstring[1]`. The next output byte `output[1]` holds 2 bits of `yourstring[1]`, 5 bits of `yourstring[2]` and 1 bit of `yourstring[3]`. And so on. – Weather Vane Feb 23 '15 at 20:52

1 Answers1

3

Sure, just map each character to a new encoding. However as you reduce the number of bits, you support fewer possible characters in your 'alphabet'. For example 5 bits can only support 32 possible characters.

Huffman encoding allows variable length codes, but when designed right you will on average have shorter codes.

A third option is to keep ascii encoding, but use some sort of compression to reduce the number of bytes.

There are quite a few actual implementations to do each of these. For example, if you know you only have 26 upper case letters 'a'-'z', spaces, and no numbers, you could use a 5-bit value, because you only need 27 values. A simple method would be to convert each character like this:

out_char = (in_char == ' ') ? 31 : (in_char - 'A');

If you need upper and lower case, you would need 52 characters, so you need 6 bits.

Huffman implementation requires understanding the statistics of how often each character occurs.

caveman
  • 1,755
  • 1
  • 14
  • 19
  • I've been thinking about a way to implement this, but is there any way to store the value of a bit, if I were to throw it off of a byte encoding by using a bit-shifting operator? For example, if I were to bit shift >> 00000001, is there any way to store the 1 that I just threw off? – Leeho Lim Feb 23 '15 at 20:19
  • The value that is pushed out is just (val & 0x01) before you actually shift. – caveman Feb 23 '15 at 20:22