Paper (Ganchev et al., 2010) Posterior Regularization for Structured Latent Variable Models

Setup: Generative Model

$x$ = input, $y$ = target
Labeled examples: $X_L$ = list of inputs; $Y_L$ = list of corresponding labels
Unlabeled examples: $X$ = list of unlabeled inputs
Assume a generative model $p_\theta(x, y)$.
Define the parameter-regularized log-likelihood of the data:$$\begin{aligned} L(\theta) &= \log p_\theta(X_L, Y_L) &&\text{(log-likelihood of supervised data)} \\ & + \log \sum_Y p_\theta(X, Y) &&\text{(log-likelihood of unsupervised data)} \\ & + \log p(\theta) &&\text{(prior as regularization)} \end{aligned}$$
Example POS tagging with HMM:$$p_\theta(x, y) = \prod_i p_\theta(y_i \mid y_{i-1})\,p_\theta(x_i \mid y_i)$$
- $p_\theta(X_L, Y_L) = \prod_{(x,y)} p_\theta(x, y)$ over labeled $(x, y)$ pairs.
- $\sum_Y p_\theta(X, Y) = \prod_x \sum_y p_\theta(x, y)$ over unlabeled $x$.
  - The term $\sum_y p_\theta(x, y)$ can be computed with Viterbi.

Posterior Constraints

We want to control how $p_\theta(y \mid x)$ behave on unlabeled data.
- For example, for POS tagging, we might want $y$ to contain at least one verb.
We will enforce such a constraint in expectation.
- For example, let $\phi(x, y) =$ "negative number of verbs in $y$" (the "negative" is to make the sign correct). For each input $x$, we want $\ex{\phi(x, y)} \leq -1$, where the expectation is over $p_\theta(y \mid x)$.
- Such $\phi(x, y)$ is called a constraint feature.
More formally, given an input $x$, define a posterior regularization set$$Q_x = \set{ q_x(y) \midd \exx{q}{\phi(x, y)} \leq b }$$ Intuitively, this is the set of "allowed" distributions over $y$.
Now let's generalize from a single input $x$ to the entire corpus $X$. Let$$Q = \set{ q(Y) \midd \exx{q}{\phi(X, Y)} \leq \Mb{b} }$$ where $\Mb{b}$ is a vector with length $|X|$.
- We can also define multiple constraint features per input. For example, if we want at least one verb and one noun for POS, we can have $2|X|$ constraint features. The vector $\Mb{b}$ will now have length $2|X|$.
Goal We want $p_\theta(Y \mid X)$ to be in $Q$ or to be as close to the boundary of $Q$ as possible. So we define a posterior regularizer:$$\kl{Q}{p_\theta(Y \mid X)} = \min_{q \in Q} \kl{q(Y)}{p_\theta(Y \mid X)}$$ and subtract it from the log-likelihood to form the posterior-regularized log-likelihood:$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y\mid X)}$$

Adding Slack

We can allow some slack using slack variables:$$Q = \set{ q(Y) \midd \exists\Bs{\xi},\; \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi},\; \norm{\Bs{\xi}}_\beta \leq \epsilon }$$ for some norm $\norm{\cdot}_\beta$. We will mainly use this slack-constrained formulation.
Alternatively, instead of a fixed bound $\epsilon$ on the slack norm, we can penalize the slack norm in the objective function:$$L(\theta) - \crab{\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} + \sigma\norm{\Bs{\xi}}_\beta \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \end{aligned}}$$ This slack-penalized version works similarly to the slack-constrained version.

Computing the Posterior Regularizer

Let's write the posterior regularizer (with slack) as a convex optimization problem:
$$\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \\ &&& \norm{\Bs{\xi}}_\beta \leq \epsilon \end{aligned}$$
AFTER SOME ALGEBRA, we can compute the dual problem:
$$\max_{\Bs{\lambda} \geq 0}\quad - \Mb{b}\T\Bs{\lambda} - \log Z(\Bs{\lambda}) - \epsilon \norm{\Bs{\lambda}}_{\beta^*}$$
where
$$Z(\Bs{\lambda}) = \sum_Y p_\theta(Y \mid X) \exp\crab{- \Bs{\lambda}\T\phi(X, Y)}$$
and $\norm{\cdot}_{\beta^*}$ is the dual norm of $\norm{\cdot}_{\beta}$.
After getting the dual solution $\Bs{\lambda}^*$, the unique primal solution $q^*$ can be computed as
$$q^*(Y) = \frac{1}{Z(\Bs{\lambda}^*)} p_\theta(Y \mid X) \exp\crab{- {\Bs{\lambda}^*}\T \phi(X, Y)}$$
If $\phi$ can be written as a sum of local features, then optimizing $\Bs{\lambda}$ can (probably) be done by gradient ascent on $\Bs{\lambda}$.
If $\phi$ is more complex, efficient inference might be impossible, and one would need to do beam search.

Optimizing the Posterior-Regularized Log-Likelihood

Consider $L(\theta)$. Let's ignore labeled data and prior. We have$$\begin{aligned} L(\theta) &= \log p_\theta(X) \\ & = \log \sum_Y p_\theta(X, Y) \end{aligned}$$
To optimize $L(\theta)$ with EM, we start by lower-bounding $L(\theta)$:$$\begin{aligned} L(\theta) & = \log\crab{\sum_Y q(Y)\,\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(multiply and divide by }q(Y)\text{)} \\ & \geq \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(Jensen!)} \\ & =: F(q, \theta) &&\text{(Let's name it)} \end{aligned}$$
We maximize this bound $F(q, \theta)$ using coordinate ascent: we maintain an estimate $(q, \theta)$ of the solution, and then keep updating $q$ and $\theta$ alternatingly.
E-step Update $q$:$$q_\text{new} = \argmax_q F(q, \theta_\text{old})$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log\crab{p_\theta(X) p_\theta(Y \mid X)} - \sum_Y q(Y) \log{q(Y)} \\ & = \log p_\theta(X) - \sum_Y q(Y) \log\crab{\frac{q(Y)}{p_\theta(Y \mid X)}} \qquad\qquad\nail{\because\sum_Y q(Y) = 1} \\ & = L(\theta) - \kl{q(Y)}{p_\theta(Y \mid X)} \end{aligned}$$ Since $L(\theta)$ does not depend on $q$, we get$$q_\text{new}= \argmin_q \kl{q(Y)}{p_\theta(Y \mid X)}$$ Right now $q$ is unconstrained, so we get $q_\text{new} = p_\theta(Y \mid X)$.
M-step Update $\theta$:$$\theta_\text{new} = \argmax_\theta F(q_\text{old}, \theta)$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log p_\theta(X, Y) - \sum_Y q(Y) \log q(Y) \\ & = \exx{q}{\log p_\theta (X, Y)} - \exx{q}{\log q(Y)} \end{aligned}$$ Since $\exx{q}{\log q(Y)}$ does not depend on $\theta$, we get$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(X, Y)}$$ This can be computed by "soft-counting" (for simple models) or by a gradient update (for neural networks).
Now let's add posterior regularization (basically a constraint on $q$):$$J_Q(\theta) = \max_{q \in Q} F(q, \theta)$$ In the E-step, we can no longer go from $\argmin_q \Mr{KL}(\dots)$ to $p_\theta (Y \mid X)$. We need to use the method in the Computing the Posterior Regularizer section above to compute the argmin.

Discriminative Model

We can also apply posterior regularization to discriminative models.
The likelihood only has the supervised part:$$L(\theta) = \log p_\theta (Y_L \mid X_L) + \log p(\theta)$$
The regularized likelihood is still$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y \mid X)}$$
The EM algorithm for optimizing $J_Q(\theta)$ is similar, with the M-step changed to$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(Y \mid X)}$$ (The difference is having $p_\theta(Y \mid X)$ instead of $p_\theta(X, Y)$.)