Huffman coding (also known as Huffman Encoding) is an algorithm for doing data compression, and it forms the basic idea behind file compression. This post talks about the fixed-length and variable-length encoding, uniquely decodable codes, prefix rules, and Huffman Tree construction.

Overview

We already know that every character is sequences of 0's and 1's and stored using 8-bits. This is known as “fixed-length encoding”, as each character uses the same number of fixed-bit storage.

Given a text, how to reduce the amount of space required to store a character?

The idea is to use “variable-length encoding”. We can exploit the fact that some characters occur more frequently than others in a text (refer to this) to design an algorithm that can represent the same piece of text using a lesser number of bits. In variable-length encoding, we assign a variable number of bits to characters depending upon their frequency in the given text. So, some characters might end up taking a single bit, and some might end up taking two bits, some might be encoded using three bits, and so on. The problem with variable-length encoding lies in its decoding.

Given a sequence of bits, how to decode it uniquely?

Let’s consider the string aabacdab. It has 8 characters in it and uses 64–bits storage (using fixed-length encoding). If we note, the frequency of characters a, b, c and d are 4, 2, 1, 1, respectively. Let’s try to represent aabacdab using a lesser number of bits by using the fact that a occurs more frequently than b, and b occurs more frequently than c and d. We start by randomly assigning a single bit code 0 to a, 2–bit code 11 to b, and 3–bit code 100 and 011 to characters c and d, respectively.

a 0
b 11
c 100
d 011

So, the string aabacdab will be encoded to 00110100011011 (0|0|11|0|100|011|0|11) using the above codes. But the real problem lies in decoding. If we try to decode the string 00110100011011, it will lead to ambiguity as it can be decoded to,

0|011|0|100|011|0|11    adacdab
0|0|11|0|100|0|11|011   aabacabd
0|011|0|100|0|11|0|11   adacabab

and so on

To prevent ambiguities in decoding, we will ensure that our encoding satisfies the “prefix rule”, which will result in “uniquely decodable codes”. The prefix rule states that no code is a prefix of another code. By code, we mean the bits used for a particular character. In the above example, 0 is the prefix of 011, which violates the prefix rule. If our codes satisfy the prefix rule, the decoding will be unambiguous (and vice versa).

Let’s consider the above example again. This time we assign codes that satisfy the prefix rule to characters 'a', 'b', 'c', and 'd'.

a   0
b   10
c   110
d   111

Using the above codes, the string aabacdab will be encoded to 00100110111010 (0|0|10|0|110|111|0|10). Now we can uniquely decode 00100110111010 back to our original string aabacdab.

Now that we are clear on variable-length encoding and prefix rule, let’s talk about Huffman coding.

Huffman Coding

The technique works by creating a binary tree of nodes. A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the character itself, the weight (frequency of appearance) of the character. Internal nodes contain character weight and links to two child nodes. As a common convention, bit 0 represents following the left child, and a bit 1 represents following the right child. A finished tree has n leaf nodes and n-1 internal nodes. It is recommended that Huffman Tree should discard unused characters in the text to produce the most optimal code lengths.

 
We will use a priority queue for building Huffman Tree, where the node with the lowest frequency has the highest priority. Following are the complete steps:

1. Create a leaf node for each character and add them to the priority queue.

2. While there is more than one node in the queue:

  • Remove the two nodes of the highest priority (the lowest frequency) from the queue.
  • Create a new internal node with these two nodes as children and a frequency equal to the sum of both nodes’ frequencies.
  • Add the new node to the priority queue.

3. The remaining node is the root node and the tree is complete.

Consider some text consisting of only 'A', 'B', 'C', 'D', and 'E' characters, and their frequencies are 15, 7, 6, 6, 5, respectively. The following figures illustrate the steps followed by the algorithm:
 

Huffman Coding: Step 1


Huffman Coding: Step 2


Huffman Coding: Step 3


Huffman Coding: Step 4


Huffman Coding: Step 5

 
The path from the root to any leaf node stores the optimal prefix code (also called Huffman code) corresponding to the character associated with that leaf node.

Huffman Tree

Implementation

Following is the C++, Java, and Python implementation of the Huffman coding compression algorithm:

C++


Download  Run Code

Output:

Huffman Codes are:
 
c 11111
h 111100
f 11101
r 11100
t 11011
p 110101
i 1100
g 0011
l 00101
a 010
o 000
d 10011
H 00100
. 111101
s 0110
m 0111
e 110100
  101
n 1000
u 10010
 
The original string is:
Huffman coding is a data compression algorithm.
 
The encoded string is:
00100100101110111101011101010001011111100010011110010000011101110001101010101011001101011011010101111110000111110101111001101000110011011000001000101010001010011000111001100110111111000111111101
 
The decoded string is:
Huffman coding is a data compression algorithm.

Java


Download  Run Code

Output:

Huffman Codes are: { =100, a=010, c=0011, d=11001, e=110000, f=0000, g=0001, H=110001, h=110100, i=1111, l=101010, m=0110, n=0111, .=10100, o=1110, p=110101, r=0010, s=1011, t=11011, u=101011}
 
The original string is: Huffman coding is a data compression algorithm.
 
The encoded string is: 11000110101100000000011001001111000011111011001111101110001100111110111000101001100101011011010100001111100110110101001011000010111011111111100111100010101010000111100010111111011110100011010100
 
The decoded string is: Huffman coding is a data compression algorithm.

Python


Download  Run Code

Output:

Huffman Codes are: {‘l’: ‘00000’, ‘p’: ‘00001’, ‘t’: ‘0001’, ‘h’: ‘00100’, ‘e’: ‘00101’, ‘g’: ‘0011’, ‘a’: ‘010’, ‘m’: ‘0110’, ‘.’: ‘01110’, ‘r’: ‘01111’, ‘ ‘: ‘100’, ‘n’: ‘1010’, ‘s’: ‘1011’, ‘c’: ‘11000’, ‘f’: ‘11001’, ‘i’: ‘1101’, ‘o’: ‘1110’, ‘d’: ‘11110’, ‘u’: ‘111110’, ‘H’: ‘111111’}
 
The original string is: Huffman coding is a data compression algorithm.
 
The encoded string is: 11111111111011001110010110010101010011000111011110110110100011100110110111000101001111001000010101001100011100110000010111100101101110111101111010101000100000000111110011111101000100100011001110
 
The decoded string is: Huffman coding is a data compression algorithm.

Note that the input string’s storage is 47×8 = 376 bits, but our encoded string only takes 194 bits, i.e., about 48% of data compression. To make the program readable, we have used string class to store the above program’s encoded string.

Since efficient priority queue data structures require O(log(n)) time per insertion, and a complete binary tree with n leaves has 2n-1 nodes, and Huffman coding tree is a complete binary tree, this algorithm operates in O(n.log(n)) time, where n is the total number of characters.

 
References:

https://en.wikipedia.org/wiki/Huffman_coding
https://en.wikipedia.org/wiki/Variable-length_code
Dr. Naveen Garg, IIT–D (Lecture – 19 Data Compression)