Optimal prefix coding algorithm for lossless data compression. Fundamental technique in digital communications and information theory.
Huffman coding is a greedy algorithm developed by David A. Huffman in 1952 while he was a Ph.D. student at MIT. It provides an optimal method for constructing variable-length prefix codes, minimizing the expected codeword length for a given probability distribution of symbols.
Produces the minimum redundancy code for given symbol probabilities. No other prefix code can achieve better compression.
Constructs a full binary tree where leaves represent symbols and paths represent codes (0=left, 1=right).
At each step, combines the two least probable symbols/sets, ensuring optimal substructure.
Huffman coding is rooted in Claude Shannon's information theory. The fundamental concept is that information content is inversely proportional to probability.
Self-Information of symbol x with probability p(x):
Entropy (Average Information):
Huffman code achieves average length L satisfying: H(X) ≤ L < H(X) + 1
For any prefix code with codeword lengths l₁, l₂, ..., lₙ:
Huffman coding ensures this inequality is satisfied with equality (complete code), proving the optimality of the generated code.
Huffman's algorithm satisfies all three properties by construction, guaranteeing optimality.
function HuffmanCoding(S):
// S: set of symbols with frequencies
n ← |S|
Q ← priority queue ordered by frequency
for each symbol s in S:
insert Node(s.freq) into Q
for i ← 1 to n-1:
z ← new Node()
z.left ← x ← Extract-Min(Q)
z.right ← y ← Extract-Min(Q)
z.freq ← x.freq + y.freq
Insert(Q, z)
return Extract-Min(Q) // Root of Huffman tree
function AssignCodes(node, code):
if node is leaf:
node.code ← code
else:
AssignCodes(node.left, code + "0")
AssignCodes(node.right, code + "1")
| Operation | Complexity | Explanation |
|---|---|---|
| Building Priority Queue | O(n) | Using heapify operation |
| Tree Construction (n-1 iterations) | O(n log n) | Each extract-min and insert is O(log n) |
| Code Assignment | O(n) | Single traversal of tree |
| Total Time | O(n log n) | Dominated by heap operations |
| Space Complexity | O(n) | Storage for tree and codes |
Given a source with symbols {A, B, C, D, E, F} and probabilities:
| Symbol | A | B | C | D | E | F |
|---|---|---|---|---|---|---|
| Probability | 0.45 | 0.13 | 0.12 | 0.16 | 0.09 | 0.05 |
Queue: [F(0.05), E(0.09), C(0.12), B(0.13), D(0.16), A(0.45)] Sorted by increasing frequency
Remove: F(0.05), E(0.09) Create: Node(0.14) with children F(0), E(1) New Queue: [C(0.12), B(0.13), Node(0.14), D(0.16), A(0.45)]
Remove: C(0.12), B(0.13) Create: Node(0.25) with children C(0), B(1) New Queue: [Node(0.14), D(0.16), Node(0.25), A(0.45)]
Remove: Node(0.14), D(0.16) Create: Node(0.30) with children Node(0.14)(0), D(1) New Queue: [Node(0.25), Node(0.30), A(0.45)]
Remove: Node(0.25), Node(0.30) Create: Node(0.55) with children Node(0.25)(0), Node(0.30)(1) New Queue: [A(0.45), Node(0.55)]
Remove: A(0.45), Node(0.55) Create: Root(1.00) with children A(0), Node(0.55)(1) Final Tree Complete!
| Symbol | Probability | Huffman Code | Length | Contribution (p × l) |
|---|---|---|---|---|
| A | 0.45 | 0 |
1 | 0.45 |
| B | 0.13 | 101 |
3 | 0.39 |
| C | 0.12 | 100 |
3 | 0.36 |
| D | 0.16 | 111 |
3 | 0.48 |
| E | 0.09 | 1101 |
4 | 0.36 |
| F | 0.05 | 1100 |
4 | 0.20 |
| Average Code Length (L) | 2.24 bits/symbol | |||
| Entropy H(X) | 2.19 bits/symbol | |||
| Efficiency (H/L) | 97.8% | |||
Encoding the string "BAD":
B → 101 A → 0 D → 111 BAD → 101 0 111 = 1010111 (7 bits) Fixed-length (3 bits each): 011 000 011 = 011000011 (9 bits) Savings: 22.2%
GZIP, ZIP, Brotli (combined with LZ77)
JPEG (DC coefficients), PNG (DEFLATE)
MP3 (Huffman coding of quantized values)
H.264/AVC, H.265/HEVC, AV1
Efficient data transmission protocols
Randomness extraction, entropy coding
Updates tree dynamically as data arrives. Used when symbol statistics are unknown beforehand (e.g., Vitter's algorithm).
Extension to D-ary trees (D > 2). Useful when channel supports multiple symbols or for byte-aligned coding.
Constrains maximum codeword length. Important for hardware implementations with limited buffer sizes.
Codes are lexicographically ordered by length. Enables faster decoding with lookup tables.
| Method | Type | Optimality | Speed | Use Case |
|---|---|---|---|---|
| Huffman | Variable-length | Optimal for integer lengths | Fast | General compression |
| Arithmetic | Fractional | Near entropy limit | Slower | High compression |
| Shannon-Fano | Variable-length | Suboptimal | Fast | Historical/educational |
| LZ77/LZ78 | Dictionary | Universal | Very fast | Repetitive data |
| Run-Length | Simple | Context-dependent | Fastest | Bitmaps, fax |
Given symbols {X, Y, Z} with probabilities {0.6, 0.3, 0.1}, construct the Huffman code and calculate the average code length.
Step 1: Combine Z(0.1) and Y(0.3) → Node(0.4) Step 2: Combine Node(0.4) and X(0.6) → Root(1.0) Codes: X: 1 (length 1) Y: 01 (length 2) Z: 00 (length 2) Average Length = 0.6×1 + 0.3×2 + 0.1×2 = 1.4 bits/symbol Entropy = 1.295 bits/symbol Efficiency = 92.5%
For a source with 8 equiprobable symbols, what is the Huffman coding efficiency compared to fixed-length coding?
For 8 equiprobable symbols: p = 1/8 = 0.125 each Huffman will create a balanced tree: All codes have length 3 (since 2^3 = 8) Fixed-length: 3 bits/symbol (⌈log₂8⌉ = 3) Huffman: 3 bits/symbol Efficiency = 100% (no improvement for uniform distribution) Entropy H(X) = -8 × (1/8) × log₂(1/8) = 3 bits/symbol This demonstrates Huffman coding works best with skewed distributions.
A sensor transmits 4 types of packets: Data (70%), Control (20%), Ack (8%), and Error (2%). Design a Huffman code for this protocol and calculate the bandwidth savings if the channel capacity is 1000 packets/second.
Probabilities: D:0.70, C:0.20, A:0.08, E:0.02 Construction: 1. Combine E(0.02) + A(0.08) = Node(0.10) 2. Combine Node(0.10) + C(0.20) = Node(0.30) 3. Combine Node(0.30) + D(0.70) = Root(1.00) Codes: D: 1 (1 bit) C: 01 (2 bits) A: 001 (3 bits) E: 000 (3 bits) Average length = 0.7×1 + 0.2×2 + 0.08×3 + 0.02×3 = 1.4 bits/packet Fixed-length would require 2 bits/packet (⌈log₂4⌉ = 2) Bandwidth savings: (2 - 1.4)/2 × 100% = 30% Effective capacity: 1000/1.4 = ~714 packets/sec vs 500 packets/sec fixed
Average Code Length: L = Σ pᵢ × lᵢ
Efficiency: η = H(X)/L × 100%
Redundancy: R = L - H(X) bits/symbol
Variance: σ² = Σ pᵢ(lᵢ - L)²