Paper (Ganchev et al., 2010) Posterior Regularization for Structured Latent Variable Models
Setup: Generative Model
- $x$ = input, $y$ = target
- Labeled examples: $X_L$ = list of inputs; $Y_L$ = list of corresponding labels
- Unlabeled examples: $X$ = list of unlabeled inputs
- Assume a generative model $p_\theta(x, y)$.
- Define the parameter-regularized log-likelihood of the data:$$\begin{aligned} L(\theta) &= \log p_\theta(X_L, Y_L) &&\text{(log-likelihood of supervised data)} \\ & + \log \sum_Y p_\theta(X, Y) &&\text{(log-likelihood of unsupervised data)} \\ & + \log p(\theta) &&\text{(prior as regularization)} \end{aligned}$$
- Example POS tagging with HMM:$$p_\theta(x, y) = \prod_i p_\theta(y_i \mid y_{i-1})\,p_\theta(x_i \mid y_i)$$
- $p_\theta(X_L, Y_L) = \prod_{(x,y)} p_\theta(x, y)$ over labeled $(x, y)$ pairs.
- $\sum_Y p_\theta(X, Y) = \prod_x \sum_y p_\theta(x, y)$ over unlabeled $x$.
- The term $\sum_y p_\theta(x, y)$ can be computed with Viterbi.
Posterior Constraints
- We want to control how $p_\theta(y \mid x)$ behave on unlabeled data.
- For example, for POS tagging, we might want $y$ to contain at least one verb.
- We will enforce such a constraint in expectation.
- For example, let $\phi(x, y) =$ "negative number of verbs in $y$" (the "negative" is to make the sign correct). For each input $x$, we want $\ex{\phi(x, y)} \leq -1$, where the expectation is over $p_\theta(y \mid x)$.
- Such $\phi(x, y)$ is called a constraint feature.
- More formally, given an input $x$, define a posterior regularization set$$Q_x = \set{ q_x(y) \midd \exx{q}{\phi(x, y)} \leq b }$$ Intuitively, this is the set of "allowed" distributions over $y$.
- Now let's generalize from a single input $x$ to the entire corpus $X$. Let$$Q = \set{ q(Y) \midd \exx{q}{\phi(X, Y)} \leq \Mb{b} }$$ where $\Mb{b}$ is a vector with length $|X|$.
- We can also define multiple constraint features per input. For example, if we want at least one verb and one noun for POS, we can have $2|X|$ constraint features. The vector $\Mb{b}$ will now have length $2|X|$.
- Goal We want $p_\theta(Y \mid X)$ to be in $Q$ or to be as close to the boundary of $Q$ as possible. So we define a posterior regularizer:$$\kl{Q}{p_\theta(Y \mid X)} = \min_{q \in Q} \kl{q(Y)}{p_\theta(Y \mid X)}$$ and subtract it from the log-likelihood to form the posterior-regularized log-likelihood:$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y\mid X)}$$
Adding Slack
- We can allow some slack using slack variables:$$Q = \set{ q(Y) \midd \exists\Bs{\xi},\; \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi},\; \norm{\Bs{\xi}}_\beta \leq \epsilon }$$ for some norm $\norm{\cdot}_\beta$. We will mainly use this slack-constrained formulation.
- Alternatively, instead of a fixed bound $\epsilon$ on the slack norm, we can penalize the slack norm in the objective function:$$L(\theta) - \crab{\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} + \sigma\norm{\Bs{\xi}}_\beta \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \end{aligned}}$$ This slack-penalized version works similarly to the slack-constrained version.
Computing the Posterior Regularizer
Let's write the posterior regularizer (with slack) as a convex optimization problem:
$$\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \\ &&& \norm{\Bs{\xi}}_\beta \leq \epsilon \end{aligned}$$AFTER SOME ALGEBRA, we can compute the dual problem:
$$\max_{\Bs{\lambda} \geq 0}\quad - \Mb{b}\T\Bs{\lambda} - \log Z(\Bs{\lambda}) - \epsilon \norm{\Bs{\lambda}}_{\beta^*}$$where
$$Z(\Bs{\lambda}) = \sum_Y p_\theta(Y \mid X) \exp\crab{- \Bs{\lambda}\T\phi(X, Y)}$$and $\norm{\cdot}_{\beta^*}$ is the dual norm of $\norm{\cdot}_{\beta}$.
After getting the dual solution $\Bs{\lambda}^*$, the unique primal solution $q^*$ can be computed as
$$q^*(Y) = \frac{1}{Z(\Bs{\lambda}^*)} p_\theta(Y \mid X) \exp\crab{- {\Bs{\lambda}^*}\T \phi(X, Y)}$$If $\phi$ can be written as a sum of local features, then optimizing $\Bs{\lambda}$ can (probably) be done by gradient ascent on $\Bs{\lambda}$.
If $\phi$ is more complex, efficient inference might be impossible, and one would need to do beam search.
Optimizing the Posterior-Regularized Log-Likelihood
- Consider $L(\theta)$. Let's ignore labeled data and prior. We have$$\begin{aligned} L(\theta) &= \log p_\theta(X) \\ & = \log \sum_Y p_\theta(X, Y) \end{aligned}$$
- To optimize $L(\theta)$ with EM, we start by lower-bounding $L(\theta)$:$$\begin{aligned} L(\theta) & = \log\crab{\sum_Y q(Y)\,\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(multiply and divide by }q(Y)\text{)} \\ & \geq \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(Jensen!)} \\ & =: F(q, \theta) &&\text{(Let's name it)} \end{aligned}$$
- We maximize this bound $F(q, \theta)$ using coordinate ascent: we maintain an estimate $(q, \theta)$ of the solution, and then keep updating $q$ and $\theta$ alternatingly.
- E-step Update $q$:$$q_\text{new} = \argmax_q F(q, \theta_\text{old})$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log\crab{p_\theta(X) p_\theta(Y \mid X)} - \sum_Y q(Y) \log{q(Y)} \\ & = \log p_\theta(X) - \sum_Y q(Y) \log\crab{\frac{q(Y)}{p_\theta(Y \mid X)}} \qquad\qquad\nail{\because\sum_Y q(Y) = 1} \\ & = L(\theta) - \kl{q(Y)}{p_\theta(Y \mid X)} \end{aligned}$$ Since $L(\theta)$ does not depend on $q$, we get$$q_\text{new}= \argmin_q \kl{q(Y)}{p_\theta(Y \mid X)}$$ Right now $q$ is unconstrained, so we get $q_\text{new} = p_\theta(Y \mid X)$.
- M-step Update $\theta$:$$\theta_\text{new} = \argmax_\theta F(q_\text{old}, \theta)$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log p_\theta(X, Y) - \sum_Y q(Y) \log q(Y) \\ & = \exx{q}{\log p_\theta (X, Y)} - \exx{q}{\log q(Y)} \end{aligned}$$ Since $\exx{q}{\log q(Y)}$ does not depend on $\theta$, we get$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(X, Y)}$$ This can be computed by "soft-counting" (for simple models) or by a gradient update (for neural networks).
- Now let's add posterior regularization (basically a constraint on $q$):$$J_Q(\theta) = \max_{q \in Q} F(q, \theta)$$ In the E-step, we can no longer go from $\argmin_q \Mr{KL}(\dots)$ to $p_\theta (Y \mid X)$. We need to use the method in the Computing the Posterior Regularizer section above to compute the argmin.
Discriminative Model
- We can also apply posterior regularization to discriminative models.
- The likelihood only has the supervised part:$$L(\theta) = \log p_\theta (Y_L \mid X_L) + \log p(\theta)$$
- The regularized likelihood is still$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y \mid X)}$$
- The EM algorithm for optimizing $J_Q(\theta)$ is similar, with the M-step changed to$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(Y \mid X)}$$ (The difference is having $p_\theta(Y \mid X)$ instead of $p_\theta(X, Y)$.)