Link Search Menu Expand Document

FFVB: Control variate

This section describes a control variate technique for variance reduction.

See: Variational Bayes Introduction, Fixed Form Variational Bayes


Control variate

As is typical of stochastic optimization algorithms, the performance of Algorithm 3 depends greatly on the variance of the noisy gradient. Variance reduction for the noisy gradient is a key ingredient in FFVB algorithms.

Let θsqλ(θ), s=1,,S, be S samples from the variational distribution qλ(θ). A naive estimator of the ith element of the vector λLB(λ) is λiLB^(λ)naive=1Ss=1Sλi[logqλ(θs)]×hλ(θs), whose variance is often too large to be useful.

For any number ci, consider

(21)λiLB^(λ)=1Ss=1Sλi[logqλ(θs)](hλ(θs)ci),

which is still an unbiased estimator of λiLB(λ) since E(λ[logqλ(θ)])=0, whose variance can be greatly reduced by an appropriate choice of control variate ci.

The variance of λiLB^(λ) is

1SV(λi[logqλ(θ)]hλ(θ))+ci2SV(λi[logqλ(θ)])2ciScov(λi[logqλ(θ)]hλ(θ),λi[logqλ(θ)]).

The optimal ci that minimizes this variance is

(22)ci=cov(λi[logqλ(θ)]hλ(θ),λi[logqλ(θ)])/V(λi[logqλ(θ)]).

Then

V(λiLB^(λ))=V(λiLB^(λ)naive)(1ρi2)V(λiLB^(λ)naive),

where ρi is the correlation between λi[logqλ(θ)]hλ(θ) and λi[logqλ(θ)]. Often, ρi2 is very close to 1, which leads to a large variance reduction.

One can estimate the numbers ci in (22) using samples θsqλ(θ). In order to ensure the unbiasedness of the gradient estimator, the samples used to estimate ci must be independent of the samples used to estimate the gradient.

In practice, the ci can be updated sequentially as follows. At iteration t, we use the ci computed in the previous iteration t1, i.e. based on the samples from qλ(t1)(θ), to estimate the gradient λLB^(λ(t)), which is computed using new samples from qλ(t)(θ).

We then update the ci using this new set of samples. By doing so, the unbiasedness is guaranteed while no extra samples are needed in updating the control variates ci.

Algorithm 4 provides a detailed pseudo-code implementation of the FFVB approach that uses the control variate for variance reduction and moving average adaptive learning, and Algorithm 5 implements the FFVB approach that uses the control variate and natural gradient.

Algorithm 4: FFVB with control variates and adaptive learning

  • Input: Initial λ(0), adaptive learning weights β1,β2(0,1), fixed learning rate ϵ0, threshold τ, rolling window size tW and maximum patience P. Model-specific requirement: function h(θ):=log(p(θ)p(yθ)).
  • Initialization
    • Generate θsqλ(0)(θ), s=1,,S.
    • Compute the unbiased estimate of the LB gradient

      λLB^(λ(0)):=1Ss=1Sλlogqλ(θs)×hλ(θs)|λ=λ(0).
    • Set g0:=λLB^(λ(0)), v0:=(g0)2, g¯:=g0, v¯:=v0.
    • Estimate the vector of control variates c as in (22) using the samples θs,s=1,,S.
    • Set t=0, patience=0 and stop=false.
  • While stop=false:
    • Generate θsqλ(t)(θ), s=1,,S.
    • Compute the unbiased estimate of the LB gradient

      gt:=λLB^(λ(t))=1Ss=1Sλlogqλ(θs)(hλ(θs)c)|λ=λ(t).
    • Estimate the new control variate vector c as in (22) using the samples θs,s=1,,S.
    • Compute vt=(gt)2 and

      g¯=β1g¯+(1β1)gt,v¯=β2v¯+(1β2)vt.
    • Compute αt=min(ϵ0,ϵ0τt) and update

      λ(t+1)=λ(t)+αtg¯/v¯
    • Compute the lower bound estimate

      LB^(λ(t)):=1Ss=1Shλ(t)(θs). \item If ttW: compute the moving averaged lower bound LBttW+1=1tWk=1tWLB^(λ(tk+1)),

      and if LBttW+1max(LB) patience = 0; else patience:=patience+1.

    • If patienceP, stop=true.
    • Set t:=t+1.

Note: The term λlogqλ(θs)(hλ(θs)c) should be understood component-wise, i.e. it is the vector whose ith element is λilogqλ(θs)×(hλ(θs)ci).

Algorithm 5: FFVB with control variates and natural gradient

  • Input: Initial λ(0), momentum weight αm, fixed learning rate ϵ0, threshold τ, rolling window size tW and maximum patience P. Model-specific requirement: function function h(θ):=log(p(θ)p(yθ)).
  • Initialization
    • Generate θsqλ(0)(θ), s=1,,S.
    • Compute the unbiased estimate of the LB gradient

      λLB^(λ(0)):=1Ss=1Sλlogqλ(θs)×hλ(θs)|λ=λ(0).

      and the natural gradient

      λLB^(λ(0))nat:=IF1(λ(0))λLB^(λ(0)).
    • Set momentum gradient
    λLB:=λLB^(λ(0))nat.
    • Estimate control variate vector c as in (22) using the samples θs,s=1,,S.
    • Set t=0, patience=0 and stop=false.
  • While stop=false:
    • Generate θsqλ(t)(θ), s=1,,S.
    • Compute the unbiased estimate of the LB gradient

      λLB^(λ(t))=1Ss=1Sλlogqλ(θs)(hλ(θs)c)|λ=λ(t)

      and the natural gradient

      λLB^(λ(t))nat=IF1(λ(t))λLB^(λ(t)).
    • Estimate the new control variate vector c as in (22) using the samples θs,s=1,,S.
    • Compute the momentum gradient

      λLB=αmλLB+(1αm)λLB^(λ(t))nat.
    • Compute αt=min(ϵ0,ϵ0τt) and update

      λ(t+1)=λ(t)+αtλLB.
    • Compute the lower bound estimate

      LB^(λ(t)):=1Ss=1Shλ(t)(θs).
    • If ttW: compute the moving average lower bound

      LBttW+1=1tWk=1tWLB^(λ(tk+1)),

      and if LBttW+1max(LB) patience = 0; else patience:=patience+1.

    • If patienceP, stop=true.
    • Set t:=t+1.

Next: FFVB with reparameterization trick