When we estimate parameters by Maximum Likelihood Estimation (MLE),
we’re actually minimizing the Kullback–Leibler divergence between the true data distribution and our model’s distribution.
Setup
Let the true data distribution be $p_\text{data}(x)$ (unknown),
and our model distribution be $p_\theta(x)$ with parameters $\theta$.
MLE seeks: \(\theta^* = \arg\max_\theta \; \mathbb{E}_{x \sim p_\text{data}} [ \log p_\theta(x) ].\)
Link to KL Divergence
The KL divergence from $p_\text{data}$ to $p_\theta$ is:
\[D_{\mathrm{KL}}(p_\text{data} \Vert p_\theta) = \mathbb{E}_{x \sim p_\text{data}} \left[ \log \frac{p_\text{data}(x)}{p_\theta(x)} \right].\]We can expand this as:
\[D_{\mathrm{KL}}(p_\text{data} \Vert p_\theta) = \mathbb{E}_{p_\text{data}}[\log p_\text{data}(x)] - \mathbb{E}_{p_\text{data}}[\log p_\theta(x)].\]The first term doesn’t depend on $\theta$,
so minimizing $D_{\mathrm{KL}}$ is equivalent to maximizing the expected log-likelihood!
Interpretation
- MLE: Find the model that makes the observed data most probable.
- KL minimization: Find the model closest (in information distance) to the true distribution.
They are the same optimization, viewed from two lenses:
MLE ≡ minimize $D_{\mathrm{KL}}(p_\text{data} \Vert p_\theta)$