Data Compression • Information Theory

Huffman Coding

Optimal prefix coding algorithm for lossless data compression. Fundamental technique in digital communications and information theory.

01 Introduction

Huffman coding is a greedy algorithm developed by David A. Huffman in 1952 while he was a Ph.D. student at MIT. It provides an optimal method for constructing variable-length prefix codes, minimizing the expected codeword length for a given probability distribution of symbols.

📊

Optimal Prefix Code

Produces the minimum redundancy code for given symbol probabilities. No other prefix code can achieve better compression.

🌳

Binary Tree Structure

Constructs a full binary tree where leaves represent symbols and paths represent codes (0=left, 1=right).

Greedy Approach

At each step, combines the two least probable symbols/sets, ensuring optimal substructure.

Key Characteristics

  • Lossless Compression: Original data can be perfectly reconstructed
  • Variable-Length Codes: Frequent symbols get shorter codes, rare symbols get longer codes
  • Prefix-Free: No code is a prefix of another, enabling unambiguous decoding
  • Statistical Coding: Requires knowledge of symbol frequencies/probabilities

02 Theoretical Foundation

Information Theory Basics

Huffman coding is rooted in Claude Shannon's information theory. The fundamental concept is that information content is inversely proportional to probability.

Self-Information of symbol x with probability p(x):

I(x) = -log₂(p(x)) bits

Entropy (Average Information):

H(X) = -Σ p(x) log₂(p(x))

Huffman code achieves average length L satisfying: H(X) ≤ L < H(X) + 1

Kraft-McMillan Inequality

For any prefix code with codeword lengths l₁, l₂, ..., lₙ:

Σ 2^(-lᵢ) ≤ 1

Huffman coding ensures this inequality is satisfied with equality (complete code), proving the optimality of the generated code.

Optimality Proof

Why Huffman is Optimal

  1. Property 1: In an optimal code, if p(a) < p(b), then l(a) ≥ l(b)
  2. Property 2: In an optimal code, there exist two longest codewords differing only in the last bit
  3. Property 3: The two least probable symbols have codewords of equal maximum length

Huffman's algorithm satisfies all three properties by construction, guaranteeing optimality.

03 Huffman Algorithm

Algorithm Steps

  1. Frequency Analysis: Calculate occurrence probability (or frequency) for each symbol in the source data.
  2. Initialize Priority Queue: Create a leaf node for each symbol and add it to a min-heap/priority queue ordered by frequency.
  3. Tree Construction: While more than one node in the queue:
    • Remove the two nodes with lowest frequencies
    • Create new internal node with frequency = sum of two nodes
    • Make the two nodes left and right children (0=left, 1=right)
    • Insert new node into queue
  4. Code Assignment: The remaining node is the root. Traverse from root to leaves, appending 0 for left branches and 1 for right branches.
  5. Encoding: Replace each source symbol with its corresponding Huffman code.

Pseudocode

function HuffmanCoding(S):
    // S: set of symbols with frequencies
    n ← |S|
    Q ← priority queue ordered by frequency
    
    for each symbol s in S:
        insert Node(s.freq) into Q
    
    for i ← 1 to n-1:
        z ← new Node()
        z.left ← x ← Extract-Min(Q)
        z.right ← y ← Extract-Min(Q)
        z.freq ← x.freq + y.freq
        Insert(Q, z)
    
    return Extract-Min(Q)  // Root of Huffman tree

function AssignCodes(node, code):
    if node is leaf:
        node.code ← code
    else:
        AssignCodes(node.left, code + "0")
        AssignCodes(node.right, code + "1")

Complexity Analysis

Operation Complexity Explanation
Building Priority Queue O(n) Using heapify operation
Tree Construction (n-1 iterations) O(n log n) Each extract-min and insert is O(log n)
Code Assignment O(n) Single traversal of tree
Total Time O(n log n) Dominated by heap operations
Space Complexity O(n) Storage for tree and codes

04 Worked Example

Problem Statement

Given a source with symbols {A, B, C, D, E, F} and probabilities:

Symbol A B C D E F
Probability 0.45 0.13 0.12 0.16 0.09 0.05

Step-by-Step Construction

Step 1: Initial Priority Queue

Queue: [F(0.05), E(0.09), C(0.12), B(0.13), D(0.16), A(0.45)]
Sorted by increasing frequency

Step 2: Iteration 1 - Combine F and E

Remove: F(0.05), E(0.09)
Create: Node(0.14) with children F(0), E(1)
New Queue: [C(0.12), B(0.13), Node(0.14), D(0.16), A(0.45)]

Step 3: Iteration 2 - Combine C and B

Remove: C(0.12), B(0.13)
Create: Node(0.25) with children C(0), B(1)
New Queue: [Node(0.14), D(0.16), Node(0.25), A(0.45)]

Step 4: Iteration 3 - Combine Node(0.14) and D

Remove: Node(0.14), D(0.16)
Create: Node(0.30) with children Node(0.14)(0), D(1)
New Queue: [Node(0.25), Node(0.30), A(0.45)]

Step 5: Iteration 4 - Combine Node(0.25) and Node(0.30)

Remove: Node(0.25), Node(0.30)
Create: Node(0.55) with children Node(0.25)(0), Node(0.30)(1)
New Queue: [A(0.45), Node(0.55)]

Step 6: Final Combination

Remove: A(0.45), Node(0.55)
Create: Root(1.00) with children A(0), Node(0.55)(1)
Final Tree Complete!

Final Huffman Tree Visualization

Root
1.00
A
0.45
Node
0.55
Node
0.25
Node
0.30
C
0.12
B
0.13
Node
0.14
D
0.16
F
0.05
E
0.09
Click on any node to see its code

Final Code Table

Symbol Probability Huffman Code Length Contribution (p × l)
A 0.45 0 1 0.45
B 0.13 101 3 0.39
C 0.12 100 3 0.36
D 0.16 111 3 0.48
E 0.09 1101 4 0.36
F 0.05 1100 4 0.20
Average Code Length (L) 2.24 bits/symbol
Entropy H(X) 2.19 bits/symbol
Efficiency (H/L) 97.8%

Analysis Results

Encoding Example

Encoding the string "BAD":

B → 101
A → 0
D → 111

BAD → 101 0 111 = 1010111 (7 bits)

Fixed-length (3 bits each): 011 000 011 = 011000011 (9 bits)
Savings: 22.2%

05 Applications in Communication

🗜️

File Compression

GZIP, ZIP, Brotli (combined with LZ77)

🖼️

Image Coding

JPEG (DC coefficients), PNG (DEFLATE)

🎵

Audio Compression

MP3 (Huffman coding of quantized values)

📹

Video Coding

H.264/AVC, H.265/HEVC, AV1

📡

Wireless Comms

Efficient data transmission protocols

🔒

Cryptography

Randomness extraction, entropy coding

Variants and Extensions

Adaptive Huffman Coding

Updates tree dynamically as data arrives. Used when symbol statistics are unknown beforehand (e.g., Vitter's algorithm).

n-ary Huffman Coding

Extension to D-ary trees (D > 2). Useful when channel supports multiple symbols or for byte-aligned coding.

Length-Limited Huffman

Constrains maximum codeword length. Important for hardware implementations with limited buffer sizes.

Canonical Huffman

Codes are lexicographically ordered by length. Enables faster decoding with lookup tables.

Comparison with Other Codes

Method Type Optimality Speed Use Case
Huffman Variable-length Optimal for integer lengths Fast General compression
Arithmetic Fractional Near entropy limit Slower High compression
Shannon-Fano Variable-length Suboptimal Fast Historical/educational
LZ77/LZ78 Dictionary Universal Very fast Repetitive data
Run-Length Simple Context-dependent Fastest Bitmaps, fax

06 Interactive Huffman Calculator

Results:

Practice Problems

Problem 1: Basic Encoding

Given symbols {X, Y, Z} with probabilities {0.6, 0.3, 0.1}, construct the Huffman code and calculate the average code length.

Step 1: Combine Z(0.1) and Y(0.3) → Node(0.4)
Step 2: Combine Node(0.4) and X(0.6) → Root(1.0)

Codes:
X: 1 (length 1)
Y: 01 (length 2)
Z: 00 (length 2)

Average Length = 0.6×1 + 0.3×2 + 0.1×2 = 1.4 bits/symbol
Entropy = 1.295 bits/symbol
Efficiency = 92.5%

Problem 2: Efficiency Calculation

For a source with 8 equiprobable symbols, what is the Huffman coding efficiency compared to fixed-length coding?

For 8 equiprobable symbols: p = 1/8 = 0.125 each

Huffman will create a balanced tree:
All codes have length 3 (since 2^3 = 8)

Fixed-length: 3 bits/symbol (⌈log₂8⌉ = 3)
Huffman: 3 bits/symbol

Efficiency = 100% (no improvement for uniform distribution)

Entropy H(X) = -8 × (1/8) × log₂(1/8) = 3 bits/symbol
This demonstrates Huffman coding works best with skewed distributions.

Problem 3: Communication System Design

A sensor transmits 4 types of packets: Data (70%), Control (20%), Ack (8%), and Error (2%). Design a Huffman code for this protocol and calculate the bandwidth savings if the channel capacity is 1000 packets/second.

Probabilities: D:0.70, C:0.20, A:0.08, E:0.02

Construction:
1. Combine E(0.02) + A(0.08) = Node(0.10)
2. Combine Node(0.10) + C(0.20) = Node(0.30)
3. Combine Node(0.30) + D(0.70) = Root(1.00)

Codes:
D: 1 (1 bit)
C: 01 (2 bits)
A: 001 (3 bits)
E: 000 (3 bits)

Average length = 0.7×1 + 0.2×2 + 0.08×3 + 0.02×3 = 1.4 bits/packet
Fixed-length would require 2 bits/packet (⌈log₂4⌉ = 2)

Bandwidth savings: (2 - 1.4)/2 × 100% = 30%
Effective capacity: 1000/1.4 = ~714 packets/sec vs 500 packets/sec fixed

07 Summary & Key Takeaways

Essential Formulas

Average Code Length: L = Σ pᵢ × lᵢ

Efficiency: η = H(X)/L × 100%

Redundancy: R = L - H(X) bits/symbol

Variance: σ² = Σ pᵢ(lᵢ - L)²