Finding the length of compressed text (Huffman coding)

Question

Given a text of n characters and a Binary tree, generated by Huffman coding, such that the leaf nodes have attributes: a string (the character itself) and an integer (its frequency in the text). The path from the root to any leaf represents its codeword.

I would like to write a recusive function that calculates the length of the compressed text and find its Big O-complexitiy.

So for instance, if I have text

abaccab

and each character has associated frequency and depth in Huffman tree:

   4 
  / \ 
 a:3 5 
    / \ 
   b:2 c:2

then the overall length of compressed text is 11

I came up with this, but it seems very crude:

def get_length(node, depth):
    #Leaf node
    if node.left_child is None and node.right_child is None: 
        return node.freq*depth

    #Node with only one child
    elif node.left_child is None and node.right_child is not None: 
        return get_length(node.right_child, depth+1)
    elif node.right_child is None and node.left_child is not None:
        return get_length(node.left_child, depth+1)

    #Node with two children
    else:
        return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)

get_length(root,0)

Complexity: O(log 2n) where n is the number of characters.

How can I improve this? What would be the complexity in this case?

score 1 · Answer 1 · answered Aug 04 '19 at 23:33

To find the exact total length of compresssed text, I don't see any way around having to individually deal with each unique character and the count of how many times it occurs in the text, which is a total of O(n) where n is the number of unique characters in the text (also n is the number of leaf nodes in the Huffman tree). There are several different ways to represent the mapping from Huffman codes to plaintext letters. Your binary tree representation is good for finding the exact total length of the compressed text; there is a total of 2*n - 1 nodes in the tree, where n is the number of unique characters in the text, and a recursive scan through every node requires 2*n - 1 time, which is also equivalent to a total of O(n).

def get_length(node, depth):
    #Leaf node
    if node.left_child is None and node.right_child is None: 
        return node.freq*depth

    #null link from node with only one child, either left or right:
    elif node is None:
        print("not a properly constructed Huffman tree")
        return 0

    #Node with two children
    else:
        return get_length(node.left_child, depth+1) + get_length(node.right_child, depth+1)

get_length(root,0)

Ajax1234 · Answer 2 · 2018-05-22T19:31:08.810

While the complexity to find the length of the compressed text should O(n) (utilizing simple len), the time complexity to complete the encoding should be O(nlog(n)). The algorithm is as follows:

t1 = FullTree
for each character in uncompressed input do: #O(n)
  tree_lookup(t1, character) #O(log(n))

Looping over the uncompressed input is O(n), while finding a node in a balanced binary tree is O(log(n)) (O(n) worst case or otherwise). Thus, the result is n*O(log(n)) => O(nlog(n)). Also, note that O(log 2n) for a complexity for lookup is accurate, as by rules of logarithms can be simplified to O(log(2)+log(n)) => O(k + log(n)), for some constant k. However, since Big-O only examines worst case approximations, O(k+log(n)) => O(log(n)).

You can improve your binary tree by creating a simpler lookup in your tree:

from collections import Counter

class Tree:
  def __init__(self, node1, node2):
     self.right = node1
     self.left = node2
     self.value = sum(getattr(i, 'value', i[-1]) for i in [node1, node2])
  def __contains__(self, _node):
     if self.value == _node:
       return True
     return _node in self.left or _node in self.right
  def __lt__(self, _node): #needed to apply sorted function
     return self.value < getattr(_node, 'value', _node[-1])
  def lookup(self, _t, path = []):
     if self.value == _t:
       return ''.join(map(str, path))
     if self.left and _t in self.left:
       return ''.join(map(str, path+[0])) if isinstance(self.left, tuple) else self.left.lookup(_t, path+[0])
     if self.right and _t in self.right:
       return ''.join(map(str, path+[1])) if isinstance(self.right, tuple) else self.right.lookup(_t, path+[1])
  def __getitem__(self, _node):
     return self.lookup(_node)

s = list('abaccab')
r = sorted(Counter(s).items(), key=lambda x:x[-1])
while len(r) > 1:
  a, b, *_r = r
  r = sorted(_r+[Tree(a, b)])

compressed_text = ''.join(r[0][i] for i in s)

Output:

'10110000101'

Finding the length of compressed text (Huffman coding)

2 Answers2