Fisher information (named after Ronald Fisher, who camed up with ANOVA and MLE) measures the amount of information that an observed variable X has about a hidden variable θ.
Let p(X∣θ) be the likelihood distribution. Then the log-likelihood is
LL(θ)=logp(X∣θ)
We define the score as
score(θ)=∂θ∂logp(X∣θ)
Under regular conditions, the first moment of the score is 0:
EX[score(θ)∣θ]=∫(∂θ∂logp(x∣θ))p(x∣θ)dx=∫(p(x∣θ)1∂θ∂p(x∣θ))p(x∣θ)dx=∂θ∂∫p(x∣θ)dx=∂θ∂1=0
The Fisher information is the second moment:
I(θ)=EX[(∂θ∂logp(X∣θ))2∣∣∣∣∣∣θ]
Under regular conditions (p(X∣θ) is twice differentiable), we can DO ALGEBRA and get
I(θ)=−EX[∂θ2∂2logp(X∣θ)∣∣∣∣∣θ]
Intuition
Fisher information measures the curvature of the log-likelihood. If we plot the log-likelihood, high curvature = deep valley = easy to get optimal θ = we got a lot of information about θ from X = high Fisher information.
Properties
IX,Y(θ)=IX(θ)+IY(θ)
If T(X) is a sufficient statistics for θ (i.e., p(X=x∣T(X)=t,θ)=p(X=x∣T(X)=t) independent of θ), then IT(θ)=IX(θ)
For other statistics T(X), we get IT(θ)≤IX(θ)
Theorem (Cramér-Rao Bound) For any unbiased estimator θ^,
Var[θ^]≥I(θ)1
This makes sense: less information means that it is more difficult to pinpoint the estimator, and thus the variance increases.
Definition (Jeffreys Prior) The Jeffreys prior is defined as
pθ(θ)∝I(θ)
The Jeffreys prior is an uninformative prior that is not sensitive to parameterization; i.e., both the original pθ(θ) and
pϕ(ϕ)=pθ(θ)∣∣∣∣∣dϕdθ∣∣∣∣∣
for any reparameterization ϕ=h(θ) will be uninformative.
Fisher Information Matrix
Suppose we have N parameters θ=(θ1,…,θN). The Fisher information becomes an N×N matrix I(θ) with entries
I(θ)ij=EX[(∂θi∂logp(X∣θ))(∂θj∂logp(X∣θ))∣∣∣∣∣θ]
Note that I(θ)⪰0.
Again, under regular conditions, we also get
I(θ)ij=−EX[∂θi∂θj∂2logp(X∣θ)∣∣∣∣∣θ]
If I(θ)ij=0, we say that θi and θj are orthogonal parameters, and their MLE will be independent.