Study Guide: Introduction to Information Theory

1. Overview of Information Theory

Information Theory, founded by Claude Shannon in 1948, is the mathematical study of quantification, storage, and communication of information. It provides the fundamental limits on signal processing operations such as compressing data and reliably storing and communicating data.

The theory addresses two primary questions:

Source Coding: What is the ultimate data compression limit? (Answer: Entropy)
Channel Coding: What is the ultimate transmission rate? (Answer: Channel Capacity)

Key Historical Milestones

1924: Nyquist's work on telegraph transmission speed
1928: Hartley defines information measure
1948: Shannon's "A Mathematical Theory of Communication"
1960s: Development of error-correcting codes

Fundamental Concepts Map

Measure of Information

Self-information, Entropy, Joint & Conditional Entropy

Source Coding

Lossless compression, Kraft inequality, Huffman coding

Channel Coding

Noisy channels, Channel capacity, Shannon's limit

2. Entropy and Information Measures

2.1 Self-Information

The self-information (or surprisal) of an event measures the amount of information gained when that event occurs. Less probable events carry more information.

$$I(x) = -\log_2 P(x) = \log_2 \frac{1}{P(x)} \text{ bits}$$

Properties:

$I(x) \geq 0$ (non-negative)
$I(x) = 0$ if $P(x) = 1$ (certain events)
Additive for independent events

Example:

Fair coin flip: $P(H) = 0.5$
$I(H) = -\log_2(0.5) = 1$ bit

2.2 Shannon Entropy

Entropy $H(X)$ is the expected value of self-information. It represents the average uncertainty or information content of a random variable.

$$H(X) = -\sum_{i=1}^{n} P(x_i) \log_2 P(x_i) \text{ bits}$$

Maximum Entropy

Uniform distribution maximizes entropy:
$H_{max} = \log_2 n$ for $n$ symbols

Minimum Entropy

Deterministic source (one symbol with $P=1$):
$H_{min} = 0$ bits

Binary Entropy

For $P(0) = p$, $P(1) = 1-p$:
$H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)$

Physical Interpretation: Entropy represents the minimum average number of bits needed to encode symbols from the source without loss of information (Source Coding Theorem).

2.3 Joint and Conditional Entropy

Joint Entropy $H(X,Y)$

$$H(X,Y) = -\sum_{x,y} P(x,y) \log_2 P(x,y)$$

Measures the total uncertainty in the pair of random variables $(X,Y)$.

Conditional Entropy $H(Y|X)$

$$H(Y|X) = \sum_{x} P(x) H(Y|X=x)$$

Measures the remaining uncertainty in $Y$ given knowledge of $X$.

Chain Rule

$$H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$

2.4 Mutual Information

Mutual Information $I(X;Y)$ measures the amount of information shared between two random variables. It quantifies the reduction in uncertainty of one variable given the other.

$$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$$

Properties

• $I(X;Y) \geq 0$ (non-negative)
• $I(X;Y) = I(Y;X)$ (symmetric)
• $I(X;X) = H(X)$ (self-information)
• $I(X;Y) = 0$ if $X,Y$ independent

Relationship Diagram

Venn Diagram Relationships:

I(X;Y) = H(X) ∩ H(Y)
H(X,Y) = H(X) ∪ H(Y)

3. Source Coding Theory

3.1 Source Coding Theorem (Shannon's First Theorem)

For a discrete memoryless source (DMS) with entropy $H(X)$, the average code length $L$ satisfies:

$$H(X) \leq L < H(X) + 1$$

This establishes that entropy is the fundamental limit of lossless compression. No uniquely decodable code can achieve average length less than the entropy.

3.2 Kraft-McMillan Inequality

A necessary and sufficient condition for the existence of a uniquely decodable code with codeword lengths $l_1, l_2, ..., l_n$ over an alphabet of size $D$:

$$\sum_{i=1}^{n} D^{-l_i} \leq 1$$

Prefix Codes

No codeword is a prefix of another. Prefix codes are instantaneously decodable and satisfy Kraft inequality with equality for optimal codes.

Code Efficiency

$$\eta = \frac{H(X)}{L} \times 100\%$$ Redundancy: $1 - \eta$

3.3 Huffman Coding Algorithm

Huffman coding is an optimal prefix coding algorithm that minimizes the expected code length. It is a greedy algorithm that constructs the code tree bottom-up.

Algorithm Steps:

Sort symbols by probability (descending)
Combine two lowest probability symbols into a single node
Assign 0 and 1 to the two branches
Repeat until only one node remains
Read code from root to leaf

Optimality: Huffman coding is optimal among all prefix codes for a given set of probabilities. However, it assumes known fixed probabilities and integer code lengths.

3.4 Extended Sources & Block Coding

Coding blocks of $N$ symbols together can approach entropy more closely:

$$H(X) \leq \frac{L_N}{N} < H(X) + \frac{1}{N}$$

As $N \to \infty$, the average length per symbol approaches the entropy. This is the basis for arithmetic coding and other modern compression methods.

4. Channel Capacity and Coding

4.1 Discrete Memoryless Channels (DMC)

A DMC is characterized by:

Input alphabet $\mathcal{X}$ and output alphabet $\mathcal{Y}$
Transition probabilities $P(y|x)$ (channel matrix)
Memoryless property: Output depends only on current input

Binary Symmetric Channel (BSC)

Crossover probability $p$. Channel capacity:

$C = 1 - H_b(p)$ bits/channel use

Binary Erasure Channel (BEC)

Erasure probability $\alpha$. Channel capacity:

$C = 1 - \alpha$ bits/channel use

4.2 Channel Capacity

The channel capacity $C$ is the maximum mutual information between input and output, maximized over all input distributions:

$$C = \max_{P(x)} I(X;Y) \text{ bits/channel use}$$

Properties:

$C \geq 0$ (non-negative)
$C \leq \min(\log_2|\mathcal{X}|, \log_2|\mathcal{Y}|)$
$I(X;Y)$ is concave in $P(x)$ for fixed $P(y|x)$

4.3 Channel Coding Theorem (Shannon's Second Theorem)

For a DMC with capacity $C$, any rate $R < C$ is achievable with arbitrarily low error probability. Conversely, if $R > C$, reliable communication is impossible.

Achievable Rates

For any $\epsilon > 0$ and $R < C$, there exists a code with:

Rate $\geq R$
Block length $n$
Error probability $< \epsilon$

Converse

Any sequence of codes with rate $R > C$ has error probability bounded away from zero:

$P_e \geq 1 - \frac{C + 1/n}{R}$

Practical Implications

The theorem is existential (using random coding arguments) but does not provide constructive methods. Modern codes (Turbo codes, LDPC codes) approach this limit.

4.4 AWGN Channel Capacity

For the Additive White Gaussian Noise (AWGN) channel with power constraint $P$ and noise variance $N_0/2$:

$$C = \frac{1}{2} \log_2\left(1 + \frac{P}{N}\right) = \frac{1}{2} \log_2(1 + \text{SNR}) \text{ bits/dimension}$$

This is the famous Shannon-Hartley theorem. For bandwidth $B$, the capacity is:

$C = B \log_2(1 + \text{SNR})$ bits/second

5. Interactive Calculators

Entropy Calculator

Probabilities (comma-separated, sum to 1):

BSC Capacity Calculator

Crossover Probability (p):

0 0.1 1

Channel Capacity: bits/use

Huffman Coding Generator

Symbols and Probabilities:

Symbol	Prob	Code	Length

Avg Length: bits

Entropy: bits

Efficiency: %

Key Takeaways

H(X)

Entropy

Average information content and fundamental limit of lossless compression

I(X;Y)

Mutual Information

Information shared between variables; reduction in uncertainty

C

Channel Capacity

Maximum reliable transmission rate over a noisy channel

R < C

Shannon Limit

Reliable communication possible if and only if rate is below capacity

This study guide covers fundamental concepts of Information Theory. For further study, refer to:
Cover & Thomas, "Elements of Information Theory" (Wiley, 2006)