A comprehensive study guide for undergraduate communication engineering students. Explore the mathematical foundations of data transmission, compression, and communication limits.
Information Theory, founded by Claude Shannon in 1948, is the mathematical study of quantification, storage, and communication of information. It provides the fundamental limits on signal processing operations such as compressing data and reliably storing and communicating data.
The theory addresses two primary questions:
Self-information, Entropy, Joint & Conditional Entropy
Lossless compression, Kraft inequality, Huffman coding
Noisy channels, Channel capacity, Shannon's limit
The self-information (or surprisal) of an event measures the amount of information gained when that event occurs. Less probable events carry more information.
$$I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)} \text{ bits}$$
Fair coin flip: \(P(H) = 0.5\)
\(I(H) = -\log_2(0.5) = 1\) bit
Entropy \(H(X)\) is the expected value of self-information. It represents the average uncertainty or information content of a random variable.
$$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) \text{ bits}$$
Uniform distribution maximizes entropy:
\(H_{max} = \log_2 n\) for \(n\) symbols
Deterministic source (one symbol with \(P=1\)):
\(H_{min} = 0\) bits
For \(P(0) = p\), \(P(1) = 1-p\):
\(H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)\)
Physical Interpretation: Entropy represents the minimum average number of bits needed to encode symbols from the source without loss of information (Source Coding Theorem).
$$H(X,Y) = -\sum_{x,y} P(x,y) \log_2 P(x,y)$$
Measures the total uncertainty in the pair of random variables \((X,Y)\).
$$H(Y|X) = \sum_{x} P(x) H(Y|X=x)$$
Measures the remaining uncertainty in \(Y\) given knowledge of \(X\).
$$H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$
Mutual Information \(I(X;Y)\) measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty of one variable given the other.
$$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$$
Venn Diagram Relationships:
I(X;Y) = H(X) ∩ H(Y)
H(X,Y) = H(X) ∪ H(Y)
For a discrete memoryless source (DMS) with entropy \(H(X)\), the average code length \(L\) satisfies:
$$H(X) \leq L < H(X) + 1$$
This establishes that entropy is the fundamental limit of lossless compression. No uniquely decodable code can achieve average length less than the entropy.
A necessary and sufficient condition for the existence of a uniquely decodable code with codeword lengths \(l_1, l_2, ..., l_n\) over an alphabet of size \(D\):
$$\sum_{i=1}^{n} D^{-l_i} \leq 1$$
No codeword is a prefix of another. Prefix codes are instantaneously decodable and satisfy Kraft inequality with equality for optimal codes.
$$\eta = \frac{H(X)}{L} \times 100\%$$ Redundancy: \(1 - \eta\)
Huffman coding is an optimal prefix coding algorithm that minimizes the expected code length. It is a greedy algorithm that constructs the code tree bottom-up.
Optimality: Huffman coding is optimal among all prefix codes for a given set of probabilities. However, it assumes known fixed probabilities and integer code lengths.
Coding blocks of \(N\) symbols together can approach entropy more closely:
$$H(X) \leq \frac{L_N}{N} < H(X) + \frac{1}{N}$$
As \(N \to \infty\), the average length per symbol approaches the entropy. This is the basis for arithmetic coding and other modern compression methods.
A DMC is characterized by:
Crossover probability \(p\). Channel capacity:
\(C = 1 - H_b(p)\) bits/channel use
Erasure probability \(\alpha\). Channel capacity:
\(C = 1 - \alpha\) bits/channel use
The channel capacity \(C\) is the maximum mutual information between input and output, maximized over all input distributions:
$$C = \max_{P(x)} I(X;Y) \text{ bits/channel use}$$
For a DMC with capacity \(C\), any rate \(R < C\) is achievable with arbitrarily low error probability. Conversely, if \(R > C\), reliable communication is impossible.
For any \(\epsilon > 0\) and \(R < C\), there exists a code with:
Any sequence of codes with rate \(R > C\) has error probability bounded away from zero:
\(P_e \geq 1 - \frac{C + 1/n}{R}\)
The theorem is existential (using random coding arguments) but does not provide constructive methods. Modern codes (Turbo codes, LDPC codes) approach this limit.
For the Additive White Gaussian Noise (AWGN) channel with power constraint \(P\) and noise variance \(N_0/2\):
$$C = \frac{1}{2} \log_2\left(1 + \frac{P}{N}\right) = \frac{1}{2} \log_2(1 + \text{SNR}) \text{ bits/dimension}$$
This is the famous Shannon-Hartley theorem. For bandwidth \(B\), the capacity is:
\(C = B \log_2(1 + \text{SNR})\) bits/second
Channel Capacity: bits/use
Average information content and fundamental limit of lossless compression
Information shared between variables; reduction in uncertainty
Maximum reliable transmission rate over a noisy channel
Reliable communication possible if and only if rate is below capacity
This study guide covers fundamental concepts of Information Theory. For further study, refer to:
Cover & Thomas, "Elements of Information Theory" (Wiley, 2006)