3

I have to calculate frequency of Huffman tree from a "binary file" as sole argument. I have a doubt that binary files are the files which contains "0" and "1" only.

Whereas frequency is the repetition of the number of alphabets (eg, abbacdd here freq of a=2, b=2 ,c=1, d=2). And my structure must be like this:

struct Node
{
unsigned char symbol;   /* the symbol or alphabets */
int freq;               /* related frequency */
struct Node *left,*right; /* Left and right leafs */
};

But i not at all understand how can i get the symbol and from ".bin" file (which consists of only "0" and "1") ?

When i try to see the contents of a file i get:

hp@ubuntu:~/Desktop/Internship_Xav/Huf_pointer$ xxd -b out.bin 
0000000: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000006: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000000c: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000012: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000018: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000001e: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000024: 00000000 00000000 00000000 00000000 00000000 00000000  ......
000002a: 00000000 00000000 00000000 00000000 00000000 00000000  ......
0000030: 00000000 00000000 00000000 00000000 00000000 00000000  ......
.........//Here also there is similar kind of data    ................
00008ca: 00010011 00010011 00010011 00010011 00010011 00010011  ......
00008d0: 00010011 00010011 00010011 00010011 00010011 00010011  ......
00008d6: 00010011 00010011 00010011 00010011 00010011 00010011  ..... 

So , I not at all understand where are the frequencies and where are the symbols. How to store the symbols and how to calculate frequencies. Actually after having frequencies and symbols i will create HUffman tree using it.

Rasool
  • 286
  • 3
  • 11
Sss
  • 1,519
  • 8
  • 37
  • 67
  • 2
    "I have a doubt that binary files are the files which contains "0" and "1" only" Oh, trust me, that's what they contain. – ElGavilan Feb 26 '14 at 15:09
  • The symbols are probably the data the you see. You most likely will have to calculate the frequencies yourself. – jliv902 Feb 26 '14 at 15:09
  • @jliv902 Sorry could you please explain in detail ? suppose if i take"0000030: 00000000 00000000 00000000 00000000 00000000 00000000 ...." then what is symbol and how i have to calculate frequency from it ? You know i have to create a Huffman tree from it.Any idea how i will do so using this binary file ? – Sss Feb 26 '14 at 15:12
  • @jliv902 Sorry could you please explain in detail ? suppose if i take"00008d0: 00010011 00010011 00010011 00010011 00010011 00010011 ......" then what is symbol and how i have to calculate frequency from it ? You know i have to create a Huffman tree from it.Any idea how i will do so using this binary file ? – Sss Feb 26 '14 at 15:20
  • @user234839 00010011 would be one symbol, you would have to store it in a data structure and store a count of how many times you see it show up in a file. – jliv902 Feb 26 '14 at 15:35

2 Answers2

3

First, you need to create some sort of frequency table.
You could use a std::map.
You would do something like this:

#include <algorithm>
#include <fstream>
#include <map>
#include <string>

std::map <unsigned char, int> CreateFrequencyTable (const std::string &strFile)
{
    std::map <unsigned char, int> char_freqs ; // character frequencies

    std::ifstream file (strFile) ;

    int next = 0 ;
    while ((next = file.get ()) != EOF) {
        unsigned char uc = static_cast <unsigned char> (next) ;

        std::map <unsigned char, int>::iterator iter ;
        iter = char_freqs.find (uc) ;

        // This character is in our map.
        if (iter != char_freqs.end ()) {
            iter->second += 1 ;
        }

        // This character is not in our map yet.
        else {
            char_freqs [uc] = 1 ;
        }
    }

    return char_freqs ;
}

Then you could use this function like this:

std::map <unsigned char, int> char_freqs = CreateFrequencyTable ("file") ;

You can obtain the element with the highest frequency like this:

std::map <unsigned char, int>::iterator iter = std::max_element (
    char_freqs.begin (), 
    char_freqs.end (), 
    std::map <unsigned char, int>::value_comp
) ;

Then you would need to build your Huffman tree.
Remember that the characters are all leaf nodes, so you need a way to differentiate the leaf nodes from the non-leaf nodes.

Update

If reading a single character from the file is too slow, you could always load all of the contents into a vector like this:

// Make sure to #include <iterator>
std::ifstream file ("test.txt") ;
std::istream_iterator <unsigned char> begin = file ;
std::vector<unsigned char> vecBuffer (begin, std::istream_iterator <unsigned char> ()) ;

You would still need to create a frequency table.

jliv902
  • 1,648
  • 1
  • 12
  • 21
  • OK thanks, I will try it and mark it solved soon as it works. But what is the complexity of this frequency calculation if there n symbols? – Sss Feb 26 '14 at 17:01
  • @user234839 I believe creating the table would be `O(nlogn)` and getting the element with the highest frequency would be `O(logn)`. This is theoretically better than using an array or vector, but an array or vector may perform better because of cache spatial locality. Either way, your bottleneck will be the IO (getting one character at a time will be slow), not the container. If after profiling the program, you find it too slow, you could always read the entire file into memory and then create this frequency table. – jliv902 Feb 26 '14 at 17:10
  • actually i am obliged to do it using read()(in built function only). – Sss Feb 26 '14 at 17:28
1

A symbol in a huffman tree could be anything,
but as you have to use an unsigned char per symbol
you should probably take a byte?
So no, not only 0 or 1, but eight time 0 or 1 together.

Like 00010011 somewhere in your output of xxd
xxd -b will just give you eight 0/1 per byte.
You could write a number between 0 and 255 as well,
or two times one character of 0123456789abcdef
There are lots of possibilies how to show a byte on the screen,
but that does not matter at all.

If you know how to read the content of a file in C/C++,
just read unsigned char until the file ends
and count which value is how often in there. That´s all.

As you´re probably writing decimal numbers in your program code,
there are 256 different values (0,1,2...255).
So you will need 256 integers (in an array, or your Node struct...)
to count how often each value appears.

deviantfan
  • 11,268
  • 3
  • 32
  • 49
  • thanks, I tried to do for ( counter=1; counter <= 10; counter++) { fread(&tree,sizeof(struct Node ),1,ptr_myfile); cout << "tree->freq: " <freq<< endl; } taking the same out.bin file as sole argument but it gives repeatedly tree->freq: 1707388 tree->freq: 1707388 10 times the same output(where tree is of type Node*tree see my struct fro that). Why it repeats the same value.And i am also not able to understand how i will be able to store the symbol? this is a big file. Any piece of code to make a reference please. – Sss Feb 26 '14 at 15:53