Fisher Information
Fisher information (named after Ronald Fisher, who camed up with ANOVA and MLE) measures the amount of information that an observed variable $X$ has about a hidden variable $\theta$.
Let $p(X\mid \theta)$ be the likelihood distribution. Then the log-likelihood is $$\Mr{LL}(\theta) = \log p(X\mid \theta)$$ We define the score as $$\Mr{score}(\theta) = \fracp{}{\theta} \log p(X\mid \theta)$$ Under regular conditions, the first moment of the score is 0: $$\begin{aligned} \exx{X}{\Mr{score}(\theta)\mid\theta} &= \int \nail{\fracp{}{\theta} \log p(x\mid \theta)} p(x\mid\theta)\,dx \\ &= \int \nail{\frac{1}{p(x\mid \theta)} \fracp{}{\theta}p(x\mid \theta)} p(x\mid\theta)\,dx \\ &= \fracp{}{\theta} \int p(x\mid\theta)\, dx = \fracp{}{\theta} 1 = 0 \end{aligned}$$
The Fisher information is the second moment: $$I(\theta) = \exx{X}{\nail{\fracp{}{\theta}\log p(X\mid \theta)}^2\midd\theta}$$ Under regular conditions ($p(X\mid \theta)$ is twice differentiable), we can DO ALGEBRA and get $$I(\theta) = -\exx{X}{\fracp{^2}{\theta^2}\log p(X\mid \theta)\midd \theta}$$
Intuition
Fisher information measures the curvature of the log-likelihood. If we plot the log-likelihood, high curvature = deep valley = easy to get optimal $\theta$ = we got a lot of information about $\theta$ from $X$ = high Fisher information.
Properties
- $I_{X,Y}(\theta) = I_X(\theta) + I_Y(\theta)$
- If $T(X)$ is a sufficient statistics for $\theta$ (i.e., $$p(X=x\mid T(X)=t,\theta) = p(X=x\mid T(X)=t)$$ independent of $\theta$), then $I_T(\theta) = I_X(\theta)$
- For other statistics $T(X)$, we get $I_T(\theta) \leq I_X(\theta)$
- Theorem (Cramér-Rao Bound) For any unbiased estimator $\hat\theta$, $$\var{\hat\theta} \geq \frac{1}{I(\theta)}$$ This makes sense: less information means that it is more difficult to pinpoint the estimator, and thus the variance increases.
- Definition (Jeffreys Prior) The Jeffreys prior is defined as $$p_\theta(\theta) \propto \sqrt{I(\theta)}$$ The Jeffreys prior is an uninformative prior that is not sensitive to parameterization; i.e., both the original $p_\theta(\theta)$ and $$p_\phi(\phi) = p_\theta(\theta)\abs{\fracd{\theta}{\phi}}$$ for any reparameterization $\phi = h(\theta)$ will be uninformative.
Fisher Information Matrix
Suppose we have $N$ parameters $\theta = (\theta_1, \dots, \theta_N)$. The Fisher information becomes an $N\times N$ matrix $I(\theta)$ with entries $$I(\theta)_{ij} = \exx{X}{\nail{\fracp{}{\theta_i} \log p(X\mid \theta)}\nail{\fracp{}{\theta_j} \log p(X\mid \theta)}\midd \theta}$$ Note that $I(\theta) \succeq 0$.
Again, under regular conditions, we also get $$I(\theta)_{ij} = -\exx{X}{\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p(X\mid \theta)\midd \theta}$$
If $I(\theta)_{ij} = 0$, we say that $\theta_i$ and $\theta_j$ are orthogonal parameters, and their MLE will be independent.