Posterior Regularization

Paper (Ganchev et al., 2010) Posterior Regularization for Structured Latent Variable Models

Setup: Generative Model

  • $x$ = input, $y$ = target
  • Labeled examples: $X_L$ = list of inputs; $Y_L$ = list of corresponding labels
  • Unlabeled examples: $X$ = list of unlabeled inputs
  • Assume a generative model $p_\theta(x, y)$.
  • Define the parameter-regularized log-likelihood of the data:$$\begin{aligned} L(\theta) &= \log p_\theta(X_L, Y_L) &&\text{(log-likelihood of supervised data)} \\ & + \log \sum_Y p_\theta(X, Y) &&\text{(log-likelihood of unsupervised data)} \\ & + \log p(\theta) &&\text{(prior as regularization)} \end{aligned}$$
  • Example POS tagging with HMM:$$p_\theta(x, y) = \prod_i p_\theta(y_i \mid y_{i-1})\,p_\theta(x_i \mid y_i)$$
    • $p_\theta(X_L, Y_L) = \prod_{(x,y)} p_\theta(x, y)$ over labeled $(x, y)$ pairs.
    • $\sum_Y p_\theta(X, Y) = \prod_x \sum_y p_\theta(x, y)$ over unlabeled $x$.
      • The term $\sum_y p_\theta(x, y)$ can be computed with Viterbi.

Posterior Constraints

  • We want to control how $p_\theta(y \mid x)$ behave on unlabeled data.
    • For example, for POS tagging, we might want $y$ to contain at least one verb.
  • We will enforce such a constraint in expectation.
    • For example, let $\phi(x, y) =$ "negative number of verbs in $y$" (the "negative" is to make the sign correct). For each input $x$, we want $\ex{\phi(x, y)} \leq -1$, where the expectation is over $p_\theta(y \mid x)$.
    • Such $\phi(x, y)$ is called a constraint feature.
  • More formally, given an input $x$, define a posterior regularization set$$Q_x = \set{ q_x(y) \midd \exx{q}{\phi(x, y)} \leq b }$$ Intuitively, this is the set of "allowed" distributions over $y$.
  • Now let's generalize from a single input $x$ to the entire corpus $X$. Let$$Q = \set{ q(Y) \midd \exx{q}{\phi(X, Y)} \leq \Mb{b} }$$ where $\Mb{b}$ is a vector with length $|X|$.
    • We can also define multiple constraint features per input. For example, if we want at least one verb and one noun for POS, we can have $2|X|$ constraint features. The vector $\Mb{b}$ will now have length $2|X|$.
  • Goal We want $p_\theta(Y \mid X)$ to be in $Q$ or to be as close to the boundary of $Q$ as possible. So we define a posterior regularizer:$$\kl{Q}{p_\theta(Y \mid X)} = \min_{q \in Q} \kl{q(Y)}{p_\theta(Y \mid X)}$$ and subtract it from the log-likelihood to form the posterior-regularized log-likelihood:$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y\mid X)}$$

Adding Slack

  • We can allow some slack using slack variables:$$Q = \set{ q(Y) \midd \exists\Bs{\xi},\; \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi},\; \norm{\Bs{\xi}}_\beta \leq \epsilon }$$ for some norm $\norm{\cdot}_\beta$. We will mainly use this slack-constrained formulation.
  • Alternatively, instead of a fixed bound $\epsilon$ on the slack norm, we can penalize the slack norm in the objective function:$$L(\theta) - \crab{\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} + \sigma\norm{\Bs{\xi}}_\beta \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \end{aligned}}$$ This slack-penalized version works similarly to the slack-constrained version.

Computing the Posterior Regularizer

  • Let's write the posterior regularizer (with slack) as a convex optimization problem:

    $$\begin{aligned} \min_{q, \Bs{\xi}} &&& \kl{q(Y)}{p_\theta(Y \mid X)} \\ \text{s.t.} &&& \exx{q}{\phi(X, Y)} - \Mb{b} \leq \Bs{\xi} \\ &&& \norm{\Bs{\xi}}_\beta \leq \epsilon \end{aligned}$$
  • AFTER SOME ALGEBRA, we can compute the dual problem:

    $$\max_{\Bs{\lambda} \geq 0}\quad - \Mb{b}\T\Bs{\lambda} - \log Z(\Bs{\lambda}) - \epsilon \norm{\Bs{\lambda}}_{\beta^*}$$

    where

    $$Z(\Bs{\lambda}) = \sum_Y p_\theta(Y \mid X) \exp\crab{- \Bs{\lambda}\T\phi(X, Y)}$$

    and $\norm{\cdot}_{\beta^*}$ is the dual norm of $\norm{\cdot}_{\beta}$.

  • After getting the dual solution $\Bs{\lambda}^*$, the unique primal solution $q^*$ can be computed as

    $$q^*(Y) = \frac{1}{Z(\Bs{\lambda}^*)} p_\theta(Y \mid X) \exp\crab{- {\Bs{\lambda}^*}\T \phi(X, Y)}$$
  • If $\phi$ can be written as a sum of local features, then optimizing $\Bs{\lambda}$ can (probably) be done by gradient ascent on $\Bs{\lambda}$.

  • If $\phi$ is more complex, efficient inference might be impossible, and one would need to do beam search.

Optimizing the Posterior-Regularized Log-Likelihood

  • Consider $L(\theta)$. Let's ignore labeled data and prior. We have$$\begin{aligned} L(\theta) &= \log p_\theta(X) \\ & = \log \sum_Y p_\theta(X, Y) \end{aligned}$$
  • To optimize $L(\theta)$ with EM, we start by lower-bounding $L(\theta)$:$$\begin{aligned} L(\theta) & = \log\crab{\sum_Y q(Y)\,\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(multiply and divide by }q(Y)\text{)} \\ & \geq \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} &&\text{(Jensen!)} \\ & =: F(q, \theta) &&\text{(Let's name it)} \end{aligned}$$
  • We maximize this bound $F(q, \theta)$ using coordinate ascent: we maintain an estimate $(q, \theta)$ of the solution, and then keep updating $q$ and $\theta$ alternatingly.
  • E-step Update $q$:$$q_\text{new} = \argmax_q F(q, \theta_\text{old})$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log\crab{p_\theta(X) p_\theta(Y \mid X)} - \sum_Y q(Y) \log{q(Y)} \\ & = \log p_\theta(X) - \sum_Y q(Y) \log\crab{\frac{q(Y)}{p_\theta(Y \mid X)}} \qquad\qquad\nail{\because\sum_Y q(Y) = 1} \\ & = L(\theta) - \kl{q(Y)}{p_\theta(Y \mid X)} \end{aligned}$$ Since $L(\theta)$ does not depend on $q$, we get$$q_\text{new}= \argmin_q \kl{q(Y)}{p_\theta(Y \mid X)}$$ Right now $q$ is unconstrained, so we get $q_\text{new} = p_\theta(Y \mid X)$.
  • M-step Update $\theta$:$$\theta_\text{new} = \argmax_\theta F(q_\text{old}, \theta)$$ To compute this, we write$$\begin{aligned} F(q, \theta) & = \sum_Y q(Y) \log\crab{\frac{p_\theta(X, Y)}{q(Y)}} \\ & = \sum_Y q(Y) \log p_\theta(X, Y) - \sum_Y q(Y) \log q(Y) \\ & = \exx{q}{\log p_\theta (X, Y)} - \exx{q}{\log q(Y)} \end{aligned}$$ Since $\exx{q}{\log q(Y)}$ does not depend on $\theta$, we get$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(X, Y)}$$ This can be computed by "soft-counting" (for simple models) or by a gradient update (for neural networks).
  • Now let's add posterior regularization (basically a constraint on $q$):$$J_Q(\theta) = \max_{q \in Q} F(q, \theta)$$ In the E-step, we can no longer go from $\argmin_q \Mr{KL}(\dots)$ to $p_\theta (Y \mid X)$. We need to use the method in the Computing the Posterior Regularizer section above to compute the argmin.

Discriminative Model

  • We can also apply posterior regularization to discriminative models.
  • The likelihood only has the supervised part:$$L(\theta) = \log p_\theta (Y_L \mid X_L) + \log p(\theta)$$
  • The regularized likelihood is still$$J_Q(\theta) = L(\theta) - \kl{Q}{p_\theta(Y \mid X)}$$
  • The EM algorithm for optimizing $J_Q(\theta)$ is similar, with the M-step changed to$$\theta_\text{new} = \argmax_\theta \exx{q_\text{old}}{\log p_\theta(Y \mid X)}$$ (The difference is having $p_\theta(Y \mid X)$ instead of $p_\theta(X, Y)$.)
Exported: 2021-01-28T10:42:26.327994