A faster and more accurate algorithm for estimating compressibility
- 2 to 4 times faster and more accurate answer than judging by Shannon's entropy. It is
based on the Huffman coding approach.
- Time complexity of answering does not depend on numerical values of frequency of
symbols, rather depends on the number of unique symbols. Shannon's entropy calculates
log(frequency), Hence the more the frequency, the more time it will consume to compute
this value. In the current approach, mathematical operations on frequency values are
avoided.
- For similar reasons as above, precision is also higher since our dependency on floating
point operations is avoided as well as we just rely on sum and multiplication operations and
how actual Huffman codes will contribute to total compressed size.
- Same algorithm can be enhanced to generate actual Huffman codes in less time which
won't involve complex data structures like trees, heaps or priority queue. For our different
requirements we just use the same frequency array of symbols.
Following algorithm specifies how to calculate compressibility of a file whose symbol
frequency values are stored in map array.Time comparison chart
int compressed_file_size_in_bits = 0, n=256;
/* We sort the map array in increasing order.
* We will be simulating huffman codes algorithm.
* Insertion Sort is used as its a small array of 256 symbols.
*/
insertionSort(map, 256);
for (j = 0; j < n; j++)
if (map[j] != 0)
break;
for (i = j; i + 1 < n; i++) {
j = i + 1;
/* Following is an important step, as we keep on building more
* and more codes bottom up, their contribution to compressed size
* gets governed by following formula. Copy pen simulation is recommended.
*/
compressed_file_size_in_bits = compressed_file_size_in_bits + map[i] + map[j];
/* Least two elements of the map gets summed up and form a new frequency
* value which gets placed at i+1 th index.
*/
map[i + 1] = map[i] + map[j];
// map [i+2-----] is already sorted. Just fix the first element.
Adjust_first_element(map + i + 1, n - i - 1);
}
printf("Entropy per byte %f ", compressed_file_size_in_bits * (1.0) / file_len);
void insertionSort(long arr[], long n) {
long i, key, j;
for (i = 1; i < n; i++) {
key = arr[i];
j = i - 1;
/* Move elements of arr[0..i-1], that are
greater than key, to one position ahead
of their current position */
while (j >= 0 && arr[j] > key) {
arr[j + 1] = arr[j];
j = j - 1;
}
arr[j + 1] = key;
}
}
// Assumes arr[i+1---] is already sorted. Just first
// element needs to be placed at appropriate place.
void Adjust_first_element(long arr[], long n) {
long i, key, j = 1;
key = arr[0];
while (j < n && arr[j] < key) {
arr[j - 1] = arr[j];
j = j + 1;
}
arr[j - 1] = key;
}
Construction of codes using above algorithm
Construction of codes which makes use of the above algorithm is a string manipulation problem where we start with no code for each symbol. Then we follow the same algorithm as compressed file size/ compressibility calculation. Additionally we just keep on maintaining the history of code evolution. After the iteration through frequency array finishes, our final code which contains the evolution of different Huffman code for each symbol gets stored in the top index of codes array.
At this point, a string parsing algorithm can parse this evolution and generate individual codes per symbol. Complete phenomenon involves no trees, heaps or priority queue. Just one iteration through frequency array (size 256 in most cases) would generate the evolution of codes as well as final compressed size value.
/* Generate code for map array of frequencies. Final code gets generated at
* codes[r] which can be provided as input to string parsing algorithm to
* generate code for individual symbols.
*/
void generate_code(long map[], int l, int r) {
int i, j, k, compressed_file_size_in_bits = 0;
insertionSort(map + l, r - l);
for (i = l; i + 1 <= r; i++) {
j = i + 1;
compressed_file_size = compressed_file_size_in_bits + map[i] + map[j];
char code[50] = "(";
/* According to algorithm, two different codes from two different
* nodes are getting combined in a way so that they can be separated by
* by a string parsing algorithm. Left node code, codes[i] gets appended by
* 0 and right node code, codes[j] gets appended by 1. These two codes
get
* separated by a comma.
*/
strcat(code, codes[i]);
strcat(code, "0");
strcat(code, ",");
strcat(code, codes[j]);
strcat(code, "1");
strcat(code, ")");
map[i + 1] = map[i] + map[j];
strcpy(codes[i + 1], code);
int n = r - l;
/* Adjust_first_element now takes an additional 3rd argument.
* this argument helps in adjusting codes according to how
* map elements are getting adjusted.
*/
Adjust_first_element(map + i + 1, n - i - 1, i + 1);
}
}
void insertionSort(long arr[], long n) {
long i, key, j;
// if(n>3)
// n=3;
for (i = 1; i < n; i++) {
key = arr[i];
j = i - 1;
/* Move elements of arr[0..i-1], that are
greater than key, to one position ahead
of their current position */
while (j >= 0 && arr[j] > key) {
arr[j + 1] = arr[j];
j = j - 1;
}
arr[j + 1] = key;
}
}
// Assumes arr[i+1---] is already sorted. Just first
// element needs to be placed at appropriate place.
void Adjust_first_element(long arr[], long n, int start) {
long i, key, j = 1;
char temp_arr[250];
key = arr[0];
/* How map elements will change position, codes[] element will follow
* same path
*/
strcpy(temp_arr, codes[start]);
while (j < n && arr[j] < key) {
arr[j - 1] = arr[j];
/* codes should also move according to map values */
strcpy(codes[j - 1 + start], codes[j + start]);
j = j + 1;
}
arr[j - 1] = key;
strcpy(codes[j - 1 + start], temp_arr);
}