Introduction
The goal of this page is to give visual intuition and best practices for some of the tuning done in gradient-based MCMC algorithms, while providing links to papers or other works that give more rigorous, careful results.
Hamiltonian Trajectories
Given a (differentiable) target probability distribution, \(\pi\), an initial position \(\mathbf{q}\) and an initial momentum \(\mathbf{p}\), we may solve a certain differential equation The differential equations are $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ where \(M\) is the mass matrix that we discuss later. to generate Hamiltonian trajectories:
Hamiltonian Monte Carlo
If we integrate each trajectory for a fixed length You can also use a dynamic length if you are careful! See The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo by Hoffman and Gelman. This NUTS algorithm is the basis for libraries like Stan and PyMC3. and then resample the momentum, the stationary distribution for the endpoints of the trajectories is \(\pi\), our target distribution.
Note that we have not yet talked about how to integrate these trajectories, we are just talking about a world in which we could.
The leapfrog integrator
Of course, we need to think about how to solve the differential equation from earlier. As a reminder, $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ \(\mathbf{q}\) is position, \(\mathbf{p}\) is momentum, \(\pi\) is our target distribution, and \(M\) is the mass matrix that we discuss later. For one thing, we will be using a leapfrog integrator, which is a second order integrator. Here it is on our toy distribution, compared with (in red) Euler's method.
We can compare this to a leapfrog integrator with even a very large step size, where the errors will compound more slowly The error compounds like \(\mathcal{O}(h^3)\), instead of \(\mathcal{O}(h^2)\) for Euler's method (where \(h\) is the step size). Additionally, and arguably more importantly for HMC, the leapfrog integrator is symplectic, so volumes are preserved, the integrator doesn't spiral out to infinity, and the changes in energy tend to stay bounded. , though there is eventually some very obvious differences in the trajectories!
Choosing a step size
We ignore here how long to integrate An active area of research! NUTS is commonly used, and has a dynamic trajectory length, but will end up costing about twice as much as using HMC with a fixed step size. Other approaches are ChEES-HMC, which seeks to automatically adapt HMC's trajectory length, or empirical HMC which uses a fixed distribution of trajectory lengths. , but will focus on the step size being used. It turns out that HMC speed will scale linearly with the number of steps, so 20 steps of length 0.1 will take 10 times as long as 2 steps of length 1. There is less accuracy Perhaps disastrously! Geometric integrators and the Hamiltonian Monte Carlo method by Bou-Rabee and Sanz-Serna is a terrific survey. with longer steps, and a Metropolis-Hastings correction is necessary.
Balancing step size length and integration error is an exercise in stochastic root finding Using something like Robbins-Monro (as in Andrieu and Thoms), or a dual-averaging scheme (as in the NUTS paper). , with a goal of accepting around 65% of proposals for HMC, and 80% for NUTS. In case your posterior is Gaussian or nearly Gaussian whose covariance has eigenvalues \(\lambda_1^2, \ldots, \lambda_N^2 > 0\), A Condition Number for Hamiltonian Monte Carlo by Langmore, Dikovsky, Geraedts, Norgaard, and Von Behren gives the pleasant formula $$ \left(\sum_{n=1}^N\left(\frac{\lambda_1}{\lambda_n}\right)^4\right)^{1/4} \approx \frac{\lambda_1}{h} 2^{7/4}\sqrt{\Phi^{-1}\left(1 - 0.5 P[\text{accept}]\right)}, $$ where \(h\) is the step size, and \(\Phi^{-1}\) is the inverse normal CDF.
Another way to think about integration error
Just for fun, we can consider trying to integrate along a trajectory of fixed length, while steadily increasing the step size, and watching the integration error. It is worth mentioning that the total energy (log probability of the position plus log probability of the momentum) along a Hamiltonian trajectory should remain constant. In case the total energy at a point is more than 500 less than at the start, software like Stan, PyMC3, and TensorFlow Probability will label the trajectory a divergence and short-circuit integration, rejecting the proposal.
Mass matrix adaptation
The mass matrix is the matrix \(M\) from earlier, which we get to choose when we choose a momentum distribution for HMC So \(M^{-1}\) will be the covariance for a multivariate Gaussian that the momenta are drawn from. It may be more pleasant to think of the mass matrix as the inverse of the gradient of the kinetic energy. : $$ \begin{align} \frac{d \mathbf{q}}{dt} &= M^{-1}\mathbf{p} \\ \frac{d \mathbf{p}}{dt} &= - \frac{\partial}{\partial \mathbf{q}} \log \pi, \end{align} $$ Using a mass matrix is equivalent I can't believe it took this long to cite Radford Neal. to the change of coordinates \(X \rightarrow LX\), where \(M^{-1} = LL^T\). This transformation is shown below:
The HMC sampler is particularly efficient on a standard normal target distribution, so we try to choose a mass matrix so that this equivalent transformation causes our distribution to transform into a standard normal.
More explicitly:
Given an estimate of the posterior covariance \(C\), we will set our mass matrix to be \(C^{-1}\).
In fact, this is how mass matrix estimation works (at least in TensorFlow Probability, Stan, and PyMC3): after a few draws with the identity mass matrix, the mass matrix is set to to the running covariance. Really, there is a windowed scheme: every so often, we throw out earlier draws, and only use more recent ones. See this talk for more detail on that. We spend some portion of our computational budget on estimating the mass matrix and step size, before fixing these two parameters and actually running MCMC. While changing step sizes and momentum distributions, we are not necessarily sampling from the target distribution, so the tuning samples are discarded. On the other hand, this tuning period obviates the need for a "burn-in", since it usually provides enough chance for the sampler to get near the typical set.
As an illustration of what can go wrong, we run the same adaptation scheme on our running mixture-of-Gaussian example, and we see that it ends up adapting to a single mode.
Conclusions
This is not overly novel or new, but is hopefully a graphical, intuitive demonstration of some of the parameters that are tuned and discussed in gradient-based MCMC. Remember that pictures are not proof, and that low dimensional intuition is not always trustworthy in the sorts of high dimensions where we often use HMC or NUTS.
Thanks
Particularly to Ian Langmore, Pavel Sountsov, and Matt Hoffman for patient explanations of much of this.