Adaptation in gradient-based Markov chain Monte Carlo

What makes MCMC fast?

Colin Carroll

Introduction

The goal of this page is to give visual intuition and best practices for some of the tuning done in gradient-based MCMC algorithms, while providing links to papers or other works that give more rigorous, careful results.

Hamiltonian Trajectories

Given a (differentiable) target probability distribution, \(\pi\), an initial position \(\mathbf{q}\) and an initial momentum \(\mathbf{p}\), we may solve a certain differential equation The differential equations are $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ where \(M\) is the mass matrix that we discuss later. to generate Hamiltonian trajectories:

The underlying distribution here is a mixture of three Gaussians. The trajectory is being integrated for 1,000 timesteps with a fairly small step size.

Hamiltonian Monte Carlo

If we integrate each trajectory for a fixed length You can also use a dynamic length if you are careful! See The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo by Hoffman and Gelman. This NUTS algorithm is the basis for libraries like Stan and PyMC3. and then resample the momentum, the stationary distribution for the endpoints of the trajectories is \(\pi\), our target distribution.

Note that we have not yet talked about how to integrate these trajectories, we are just talking about a world in which we could.

This is 20 draws from the same mixture of Gaussians, using a trajectory length of 2. Notice it takes a really energetic momentum draw in the right direction to jump between modes.

The leapfrog integrator

Of course, we need to think about how to solve the differential equation from earlier. As a reminder, $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ \(\mathbf{q}\) is position, \(\mathbf{p}\) is momentum, \(\pi\) is our target distribution, and \(M\) is the mass matrix that we discuss later. For one thing, we will be using a leapfrog integrator, which is a second order integrator. Here it is on our toy distribution, compared with (in red) Euler's method.

This is the same trajectory from earlier, where black is (close to) the true trajectory, and red uses Euler's method, a first-order integrator. More specifically, black uses a leapfrog integrator, and both use step size 0.05 for 1,000 steps.

We can compare this to a leapfrog integrator with even a very large step size, where the errors will compound more slowly The error compounds like \(\mathcal{O}(h^3)\), instead of \(\mathcal{O}(h^2)\) for Euler's method (where \(h\) is the step size). Additionally, and arguably more importantly for HMC, the leapfrog integrator is symplectic, so volumes are preserved, the integrator doesn't spiral out to infinity, and the changes in energy tend to stay bounded. , though there is eventually some very obvious differences in the trajectories!

This time the red line uses a leapfrog integrator uses a step size of 0.25 for 200 steps, so it is also five times faster to compute. Also worth reflecting that while integrating the Hamiltonian perfectly means we will accept the proposed state, our goal is to propose far away samples with a high probability of acceptance -- not to integrate the Hamiltonian!

Choosing a step size

We ignore here how long to integrate An active area of research! NUTS is commonly used, and has a dynamic trajectory length, but will end up costing about twice as much as using HMC with a fixed step size. Other approaches are ChEES-HMC, which seeks to automatically adapt HMC's trajectory length, or empirical HMC which uses a fixed distribution of trajectory lengths. , but will focus on the step size being used. It turns out that HMC speed will scale linearly with the number of steps, so 20 steps of length 0.1 will take 10 times as long as 2 steps of length 1. There is less accuracy Perhaps disastrously! Geometric integrators and the Hamiltonian Monte Carlo method by Bou-Rabee and Sanz-Serna is a terrific survey. with longer steps, and a Metropolis-Hastings correction is necessary.

In this demo, trajectories are integrated using step sizes of 0.2, 0.5, and 1.0, from left to right. The frame rate reflects the computation time for integrating each trajectory. As a technical note: this demo does not use a Metropolis correction, but each figure is using the same random stream, so the samples should be identical if the integration was perfect.

Balancing step size length and integration error is an exercise in stochastic root finding Using something like Robbins-Monro (as in Andrieu and Thoms), or a dual-averaging scheme (as in the NUTS paper). , with a goal of accepting around 65% of proposals for HMC, and 80% for NUTS. In case your posterior is Gaussian or nearly Gaussian whose covariance has eigenvalues \(\lambda_1^2, \ldots, \lambda_N^2 > 0\), A Condition Number for Hamiltonian Monte Carlo by Langmore, Dikovsky, Geraedts, Norgaard, and Von Behren gives the pleasant formula $$ \left(\sum_{n=1}^N\left(\frac{\lambda_1}{\lambda_n}\right)^4\right)^{1/4} \approx \frac{\lambda_1}{h} 2^{7/4}\sqrt{\Phi^{-1}\left(1 - 0.5 P[\text{accept}]\right)}, $$ where \(h\) is the step size, and \(\Phi^{-1}\) is the inverse normal CDF.

Another way to think about integration error

Just for fun, we can consider trying to integrate along a trajectory of fixed length, while steadily increasing the step size, and watching the integration error. It is worth mentioning that the total energy (log probability of the position plus log probability of the momentum) along a Hamiltonian trajectory should remain constant. In case the total energy at a point is more than 500 less than at the start, software like Stan, PyMC3, and TensorFlow Probability will label the trajectory a divergence and short-circuit integration, rejecting the proposal.

Specifically, the trajectory in black here has length 10, and we start using 100 steps of size 0.1, then steadily increase the step size and decrease the number of steps, until we have 16 steps of length 0.6 each at the end. Notice that the total error gets pretty egregious at the end, but the first few steps remain fairly accurate.

Mass matrix adaptation

The mass matrix is the matrix \(M\) from earlier, which we get to choose when we choose a momentum distribution for HMC So \(M^{-1}\) will be the covariance for a multivariate Gaussian that the momenta are drawn from. It may be more pleasant to think of the mass matrix as the inverse of the gradient of the kinetic energy. : $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ Using a mass matrix is equivalent I can't believe it took this long to cite Radford Neal. to the change of coordinates \(X \rightarrow LX\), where \(M^{-1} = LL^T\). This transformation is shown below:

This is a Gaussian with diagonal covariance (4, 0.25), smoothly transitioning under \(F(x_1, x_2) = (2x_1, 0.5x_2)\) to a standard normal in the new coordinate system. The animation is meant to convey the online adaptation that is actually used to estimate the mass matrix.

The HMC sampler is particularly efficient on a standard normal target distribution, so we try to choose a mass matrix so that this equivalent transformation causes our distribution to transform into a standard normal.

More explicitly:

Given an estimate of the posterior covariance \(C\), we will set our mass matrix to be \(C^{-1}\).

In fact, this is how mass matrix estimation works (at least in TensorFlow Probability, Stan, and PyMC3): after a few draws with the identity mass matrix, the mass matrix is set to to the running covariance. Really, there is a windowed scheme: every so often, we throw out earlier draws, and only use more recent ones. See this talk for more detail on that. We spend some portion of our computational budget on estimating the mass matrix and step size, before fixing these two parameters and actually running MCMC. While changing step sizes and momentum distributions, we are not necessarily sampling from the target distribution, so the tuning samples are discarded. On the other hand, this tuning period obviates the need for a "burn-in", since it usually provides enough chance for the sampler to get near the typical set.

This is again a Gaussian with diagonal covariance (4, 0.25), and we are using HMC to estimate the covariance. After 10 draws, we begin using the inverse of that estimate as a mass matrix. The black ellipse is the estimated covariance, and the points drawn during tuning are marked. The step size here is actually jittered between 0.07 and 0.1 to avoid too much oscillation.

As an illustration of what can go wrong, we run the same adaptation scheme on our running mixture-of-Gaussian example, and we see that it ends up adapting to a single mode.

The mixture of Gaussian target distribution with the same setup as above. Note that mass matrix adaptation hurts convergence here, since it specializes the sampler to a single mode. The covariance estimate appears after 10 draws, and is located where the origin happens to be.

Conclusions

This is not overly novel or new, but is hopefully a graphical, intuitive demonstration of some of the parameters that are tuned and discussed in gradient-based MCMC. Remember that pictures are not proof, and that low dimensional intuition is not always trustworthy in the sorts of high dimensions where we often use HMC or NUTS.

Thanks

Particularly to Ian Langmore, Pavel Sountsov, and Matt Hoffman for patient explanations of much of this.