2

I'm trying to implement a Huffman tree.

Content of my simple .txt file that I want to do a simple test:

aaaaabbbbccd

Frequencies of characters: a:5, b:4, c:2, d:1

Code Table: (Data type of 1s and 0s: string)

a:0
d:100
c:101
b:11         

Result that I want to write as binary: (22 bits)

0000011111111101101100          

How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)

genpfault
  • 51,148
  • 11
  • 85
  • 139
Murat
  • 79
  • 2
  • 10

3 Answers3

6

Answer: You can't.

The minimum amount you can write to a file (or read from it), is a char or unsigned char. For all practical purposes, a char has exactly eight bits.

You are going to need to have a one char buffer, and a count of the number of bits it holds. When that number reaches 8, you need to write it out, and reset the count to 0. You will also need a way to flush the buffer at the end. (Not that you cannot write 22 bits to a file - you can only write 16 or 24. You will need some way to mark which bits at the end are unused.)

Something like:

struct BitBuffer {
    FILE* file; // Initialization skipped.
    unsigned char buffer = 0;
    unsigned count = 0;

    void outputBit(unsigned char bit) {
         buffer <<= 1;         // Make room for next bit.
         if (bit) buffer |= 1; // Set if necessary.
         count++;              // Remember we have added a bit.
         if (count == 8) {
             fwrite(&buffer, sizeof(buffer), 1, file); // Error handling elided.
             buffer = 0;
             count = 0;
         }
    }
};
  • @Murat: Sorry about that! I have used `sizeof(buffer)`, because it is `buffer` you writing out. The fact that `bit` is the same size is irrelevant. Alternatively, you could write this as 1, because `sizeof(unsigned char)` is *defined* to be 1 (but using the constant means that you couldn't optimize by changing buffer to `uint64_t`). – Martin Bonner supports Monica Dec 21 '17 at 17:19
1

The OP asked:

How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)

You can not and here is why...


Memory model

Defines the semantics of a computer memory storage for the purpose of C++ abstract machine.

The memory available to a C++ program is one or more contiguous sequences of bytes. Each byte in memory has a unique address.

Byte

A byte is the smallest addressable unit of memory. It is defined as a contiguous sequence of bits, large enough to hold the value of any UTF-8 code unit (256 distinct values) and of (since C++14) any member of the basic execution character set (the 96 characters that are required to be single-byte). Similar to C, C++ supports bytes of sizes 8 bits and greater.

The types char, unsigned char, and signed char use one byte for both storage and value representation. The number of bits in a byte is accessible as CHAR_BIT or std::numeric_limits<unsigned char>::digits.

Compliments of cppreference.com

You can find this page here: cppreference:memory model


This comes from the 2017-03-21: standard

©ISO/IEC N4659

4.4 The C++ memory model [intro.memory]

  1. The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits,4 the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.
  2. [ Note: The representation of types is described in 6.9. —end note ]
  3. A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. —end note ] Two or more threads of execution (4.7) can access separate memory locations without interfering with each other.
  4. [ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of nonzero width. —end note ]
  5. [ Example: A structure declared as

    struct {
        char a;
        int b:5,
        c:11,
        :0,
        d:8;
        struct {int ee:8;} e;
    }
    

    contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]


    4) The number of bits in a byte is reported by the macro CHAR_BIT in the header <climits>.

This version of the standard can be found here: www.open-std.org section § 4.4 on pages 8 & 9.


The smallest possible memory module that can be written to in a program is 8 contiguous bits or more for a standard byte. Even with bit fields, the 1 byte requirement still holds. You can manipulate, toggle, set, individual bits within a byte but you can not write individual bits.

What can be done is to have a byte buffer with a count of bits written. When your required bits are written you will need to have the rest of the unused bits marked as padding or un-used buffer bits.

Edit

[Note:] -- When using bit fields or unions one thing that you must take into consideration is the endian of the specific architecture.

Francis Cugler
  • 7,788
  • 2
  • 28
  • 59
0

Answer: You can, in a way.

Hello, from my experience I have found a way to do that simple. For the task you need to define yourself and array of characters (it just needs to be for instance 1 byte, it can be bigger). After that you must define functions to access a specific bit from any element. For example, how to write an expression to get the value of the 3th bit from a char in C++.

*/*position is [1,..,n], and bytes 
are in little endian and index from 0`enter code here`*/
int bit_at(int position, unsigned char byte)
{
  return (byte & (1 << (position - 1)));
}*

Now you can vision the array of bytes as this [b1,...,bn]

Now what we actually have in memory is 8 * n bits of memory We can try to visualize it like so. NOTE: the arrays is zeroed! |0000 0000|0000 0000|...|0000 0000|

Now from this you or whoever wants can figure how to manipulate it to get a specific bit from this array. Of course there will be some sort of converted but that is not such a problem. In the end, for the encoding you provide, that is: a:0 d:100 c:101 b:11

We can encode the message "abcd", and make an array that holds the bits of the message, using the elements of the array as arrays for bits, like so:

|0111 0110|0000 0000|

You can write this to memory and you will have an excess of at most 7 bits. This is a simple example, but it can be extended into much more. I hope this gave some answers to your question.

Goshaka_
  • 1
  • 1