Stochastic Variance Reduced Gradient

Stochastic Variance Reduced Gradient (SVRG) is a method for making gradient descent converge faster by reducing the variance.

Gradient Descent

We want to minimize the objective function $$P(w) = \frac{1}{n} \sum_{i=1}^n \psi_i(w)$$

The simplest method is to use batch gradient descent: $$w^{(t)} = w^{(t-1)} - \frac{\eta_t}{n}\sum_{i=1}^n\nabla \psi_i(w^{(t-1)})$$ In good conditions (e.g., $\psi_i$ is smooth and convex while $P$ is strongly convex), then we can choose a constant $\eta$ and get a convergence rate of $O(c^T)$ (i.e., needs $O(\log 1/\epsilon)$ iterations to get to $\epsilon$ error rate).

Note The rate of $O(c^T)$ is usually called linear convergence because it looks linear on a semi-log plot. Who in the world named this!

However, the sum is expensive to compute. To save time, we can use stochastic gradient descent: $$w^{(t)} = w^{(t-1)} - \eta_t\nabla \psi_{i_t}(w^{(t-1)})$$ where $i_t \in \set{1,\dots,n}$ is chosen randomly in each iteration.

The term $\phi_{i_t}(w^{(t-1)})$ maintains the mean but introduces variance. To control the variance, we have to use a decreasing learning rate (e.g., $\eta = O(1/t)$) and get a sublinear convergence rate of $O(1/T)$ for strongly convex functions (i.e., needs $O(1/\epsilon)$ iterations to get to $\epsilon$ error rate) and $O(1/\sqrt{T})$ for general smooth functions.

Stochastic Variance Reduced Gradient

In SVRG, we maintain an estimate $\tilde w$ that is close to the optimal $w$ (e.g., let $\tilde w$ be the average of $w$ during the past $m$ iterations). We also compute the gradient at $\tilde w$: $$\tilde \mu = \frac{1}{n}\sum_{i=1}^n \nabla \psi_i(\tilde w)$$ The sum is expensive, but we will do this only occasionally (e.g., choose $m = 2n$ for convex problem and $m = 5n$ in non-convex problems).

Note that $\exx{i}{\nabla \psi_i (\tilde w) - \tilde \mu} = 0$. We now change the SGD update into $$w^{(t)} = w^{(t-1)} - \eta_t\crab{\nabla \psi_{i_t}(w^{(t-1)}) - \nabla \psi_{i_t}(\tilde w) + \tilde \mu}$$ The mean of the gradient term is still $\nabla P(w^{(t-1)})$, while the variance reduces: $$\nabla \psi_{i_t}(w^{(t-1)}) - \nabla \psi_{i_t}(\tilde w) + \tilde \mu \to \nabla \psi_{i_t}(w^{(t-1)}) - \nabla \psi_{i_t}(w^*) + 0 \to 0$$

Note Equivalently, we can write $$\tilde \psi_i(w) = \psi_i(w) - (\nabla \psi_i(\tilde w) - \tilde \mu)\T w$$ and performs SGD on $\tilde \psi$.

With constant $\eta$, SVRG achieves a convergence rate of $O(c^T)$ (w.r.t. the number of gradient computations $T$) for strongly convex functions and $O(1/T)$ for general smooth functions. Even with non-convex functions (e.g., neural networks), SVRG can still get a good convergence rate if the initial value is not far off.

Comparison to Other Methods

There are several other methods for reducing the variance:

References

Exported: 2021-01-03T10:06:28.228356