Fisher Information

Fisher Information

Fisher information (named after Ronald Fisher, who camed up with ANOVA and MLE) measures the amount of information that an observed variable XX has about a hidden variable θ\theta.

Let p(Xθ)p(X\mid \theta) be the likelihood distribution. Then the log-likelihood is LL(θ)=logp(Xθ)\Mr{LL}(\theta) = \log p(X\mid \theta) We define the score as score(θ)=θlogp(Xθ)\Mr{score}(\theta) = \fracp{}{\theta} \log p(X\mid \theta) Under regular conditions, the first moment of the score is 0: EX ⁣[score(θ)θ]=(θlogp(xθ))p(xθ)dx=(1p(xθ)θp(xθ))p(xθ)dx=θp(xθ)dx=θ1=0\begin{aligned} \exx{X}{\Mr{score}(\theta)\mid\theta} &= \int \nail{\fracp{}{\theta} \log p(x\mid \theta)} p(x\mid\theta)\,dx \\ &= \int \nail{\frac{1}{p(x\mid \theta)} \fracp{}{\theta}p(x\mid \theta)} p(x\mid\theta)\,dx \\ &= \fracp{}{\theta} \int p(x\mid\theta)\, dx = \fracp{}{\theta} 1 = 0 \end{aligned}

The Fisher information is the second moment: I(θ)=EX ⁣[(θlogp(Xθ))2  |  θ]I(\theta) = \exx{X}{\nail{\fracp{}{\theta}\log p(X\mid \theta)}^2\midd\theta} Under regular conditions (p(Xθ)p(X\mid \theta) is twice differentiable), we can DO ALGEBRA and get I(θ)=EX ⁣[2θ2logp(Xθ)  |  θ]I(\theta) = -\exx{X}{\fracp{^2}{\theta^2}\log p(X\mid \theta)\midd \theta}

Intuition

Fisher information measures the curvature of the log-likelihood. If we plot the log-likelihood, high curvature = deep valley = easy to get optimal θ\theta = we got a lot of information about θ\theta from XX = high Fisher information.

Properties

  • IX,Y(θ)=IX(θ)+IY(θ)I_{X,Y}(\theta) = I_X(\theta) + I_Y(\theta)
  • If T(X)T(X) is a sufficient statistics for θ\theta (i.e., p(X=xT(X)=t,θ)=p(X=xT(X)=t)p(X=x\mid T(X)=t,\theta) = p(X=x\mid T(X)=t) independent of θ\theta), then IT(θ)=IX(θ)I_T(\theta) = I_X(\theta)
  • For other statistics T(X)T(X), we get IT(θ)IX(θ)I_T(\theta) \leq I_X(\theta)
  • Theorem (Cramér-Rao Bound) For any unbiased estimator θ^\hat\theta, Var[θ^]1I(θ)\var{\hat\theta} \geq \frac{1}{I(\theta)} This makes sense: less information means that it is more difficult to pinpoint the estimator, and thus the variance increases.
  • Definition (Jeffreys Prior) The Jeffreys prior is defined as pθ(θ)I(θ)p_\theta(\theta) \propto \sqrt{I(\theta)} The Jeffreys prior is an uninformative prior that is not sensitive to parameterization; i.e., both the original pθ(θ)p_\theta(\theta) and pϕ(ϕ)=pθ(θ)dθdϕp_\phi(\phi) = p_\theta(\theta)\abs{\fracd{\theta}{\phi}} for any reparameterization ϕ=h(θ)\phi = h(\theta) will be uninformative.

Fisher Information Matrix

Suppose we have NN parameters θ=(θ1,,θN)\theta = (\theta_1, \dots, \theta_N). The Fisher information becomes an N×NN\times N matrix I(θ)I(\theta) with entries I(θ)ij=EX ⁣[(θilogp(Xθ))(θjlogp(Xθ))  |  θ]I(\theta)_{ij} = \exx{X}{\nail{\fracp{}{\theta_i} \log p(X\mid \theta)}\nail{\fracp{}{\theta_j} \log p(X\mid \theta)}\midd \theta} Note that I(θ)0I(\theta) \succeq 0.

Again, under regular conditions, we also get I(θ)ij=EX ⁣[2θiθjlogp(Xθ)  |  θ]I(\theta)_{ij} = -\exx{X}{\frac{\partial^2}{\partial \theta_i \partial \theta_j} \log p(X\mid \theta)\midd \theta}

If I(θ)ij=0I(\theta)_{ij} = 0, we say that θi\theta_i and θj\theta_j are orthogonal parameters, and their MLE will be independent.

References

Exported: 2021-01-14T20:37:13.215281