A tour of probabilistic programming language APIs

What does it look like to do MCMC in different frameworks?

Introduction

I wanted an easy reference for myself and others to see how different developers think about defining probabilistic models, and this is an attempt at that. I have a number of biases I am a contributor to PyMC3, and have been working on PyMC4 (which uses TensorFlow probability). I write far more Python than R, and far more R than julia or C++. , but the code shown here is open-source, and contributions/suggestions are gratefully accepted.

I am also including an opinionated - but I hope fair - summary of the projects as of the time of writing. On my TODO list are greta in R; Turing and Gen in Julia; and Figaro and Rainier in Scala. Different libraries are written for different reasons, and I care strongly about easily writing and sharing models, and then sampling from them.

The Task

I have generated a dataset as follows:

import numpy as np

np.random.seed(0)

ndims = 5
ndata = 100
X = np.random.randn(ndata, ndims)
w_ = np.random.randn(ndims)  # hidden
noise_ = 0.1 * np.random.randn(ndata)  # hidden

y_obs = X.dot(w_) + noise_

For each probabilistic programming language (PPL), I will:

Write down the following model: $$ \begin{align} p(\mathbf{w}) &\sim \mathcal{N}(\mathbf{0}, I_5)\\ p(\mathbf{y} | X, \mathbf{w}) &\sim \mathcal{N}(X\mathbf{w}, 0.1I_{100}), \end{align} $$ where $I_n$ is the $n \times n$ identity matrix.
Draw 1,000 samples from the posterior distribution $$ p(\mathbf{w} | X, \mathbf{y}) \propto p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) $$ where I pass y_obs to the code at some point.
Do a sanity check that the samples for w are reasonable near to the true values for w, given the model.For those keeping track at home, the posterior of the model given the data is also a Gaussian, $$p(\mathbf{w} | X, \mathbf{y}) = \mathcal{N}((X^TX + I_5)^{-1}X^T\mathbf{y}, (X^TX + I_5)^{-1})$$

PyStan

version 2.19.0.0

Documentation for PyStan, and for Stan itself.

Hand-rolled automatic differentiation in C++. Many (most?) good ideas in PyMC3 came from here. Updates to the code sometimes accompany papers describing the new algorithms. The authors/contributors care deeply about correctness and practicality, and it shows. A very good, fast choice if you care about MCMC. The models are in their own programming language, which can be a little funny to use in Python (where I would prefer Python objects to strings). Uses the dynamic NUTS algorithmThe Stan team actually keeps changing/improving the sampling algorithm they use, and calling it NUTS is at least a little misleading, since these improvements are substantive compared to the algorithm described in the paper. If you want to be proper, call it something like "dynamic Hamiltonian Monte Carlo method with multinomial sampling of dynamic length trajectories, generalized termination criterion, and improved adaptation of the Euclidean metric." by default, which is currently the best default to use, though only two libraries here have a good version.

import pystan

linear_regression = """
data {
  int N;   // number of data items
  int K;   // number of predictors
  matrix[N, K] X;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  vector[K] w;       // coefficients for predictors
}
model {
  y ~ normal(X * w, 0.1);  // likelihood
}
"""

linear_data = {'N': ndata,
               'K': ndims,
               'y': y_obs,
               'X': X}

sm = pystan.StanModel(model_code=linear_regression)
fit = sm.sampling(data=linear_data, iter=1000, chains=4)

PyMC3

version 3.7

Documentation.

Uses theano for automatic differentiation. The contributors are strong, all the diagnostics are good looking, and all the estimates are above average See my biases above.. More seriously, it is quite fast, flexible if you are willing to write Theano, and has robust tuning and diagnostics, so you can get some samples from a model with a minimum of fuss. Interoperates quite smoothly with Python projects. Also uses the NUTS algorithm.

import pymc3 as pm
import theano.tensor as tt

with pm.Model():
    w = pm.Normal('w', 0, 1, shape=ndims)
    y = pm.Normal('y', tt.dot(X, w), 0.1, observed=y_obs)
    trace = pm.sample(1000)

emcee

version 2.2.1

Documentation.

This library is a "pure-Python implementation of Goodman & Weare’s Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler", and does not use gradients. This means it does not scale as well to over, say 10 dimensions, but installation is very easy. Also somewhat unique in writing custom likelihood and prior density functions. Lots of examples and users in the astrophysics community.

import scipy.stats as st

import emcee

# log likelihood
def lnlike(w, X, y):
    model = X.dot(w)
    inv_sigma2 = 0.1 ** -2
    return -0.5*(np.sum((y-model)**2)*inv_sigma2 - np.log(inv_sigma2))

# Define a prior for w
w_rv = st.multivariate_normal(np.zeros(ndims), np.eye(ndims))

# Log probability for w
lnprior = w_rv.logpdf

# logp(w | X, y) = logp(y | X, w) + logp(w)
def lnprob(w, X, y):
    return lnprior(w) + lnlike(w, X, y)

nwalkers = 100
pos = w_rv.rvs(size=nwalkers)
sampler = emcee.EnsembleSampler(nwalkers, ndims, lnprob, args=(X, y_obs))

pos, lprob, rstate  = sampler.run_mcmc(pos, 1000)

Pyro

version 0.3.4

Documentation.

Uses pytorch for automatic differentiation. I get the impression the library traditionally cared more about variational inference, but running HMCAs is the case with TensorFlow Probability, this implementation is parametrized by number of steps rather than path length, which has some performance implications I still do not entirely grasp. For example, taking smaller steps with a fixed path length means each sample takes longer, while keeping a fixed number of steps means the sample will take the same amount of time but be more correlated. was quite smooth. ~~There is no NUTS, which means you may have to manually set the number of steps~~This was clearly documented, I just missed it. Updated the code example to use it, too.. Pyro implements NUTS as well as HMC.

import pyro
import torch
from pyro.infer.mcmc import NUTS, MCMC
import pyro.distributions as dist


def model(X):
    w = pyro.sample('w', dist.Normal(torch.zeros(ndims), torch.ones(ndims)))
    y = pyro.sample('y', dist.Normal(torch.matmul(X, w), 0.1 * torch.ones(ndata)),
                    obs=torch.as_tensor(y_obs, dtype=torch.float32))
    return y

nuts_kernel = NUTS(model, adapt_step_size=True)
py_mcmc = MCMC(nuts_kernel, num_samples=1_000, warmup_steps=500)

py_mcmc = py_mcmc.run(torch.as_tensor(X, dtype=torch.float32))

TensorFlow Probability

version 0.8.0-dev20190721

Documentation.

Uses tensorflow for automatic differentiation. This is also using raw HMC parametrized by number of steps (see Pyro for commentary). This is the most verbose of the libraries reviewed here, which can be good (more control) and bad (this was four lines in PyMC3)I actually had a bug for a while where I had w_dist = tfd.Normal(loc=tf.zeros(ndims), scale=1.0, name="w"). This defines a valid model, because tfp is so flexible, but the wrong model.. The computations are also all vectorized: I can sample 1,000 chains in about the same time I sample 4 chains, which seems pretty unique.Blog post coming on that. These are some reasons why one might write a high level, user friendly API on top of this library and name it PyMC4.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions

X_tensor = tf.convert_to_tensor(X, dtype='float32')

@tf.function
def target_log_prob_fn(w):
    w_dist = tfd.Normal(loc=tf.zeros((ndims, 1)), scale=1.0, name="w")
    w_prob = tf.reduce_sum(w_dist.log_prob(w))
    y_dist = tfd.Normal(loc=tf.matmul(X_tensor, w), scale=0.1, name="y")
    y_prob = tf.reduce_sum(y_dist.log_prob(y_obs.reshape(-1, 1)))
    return w_prob + y_prob


# Initialize the HMC transition kernel.
num_results = 1000
num_burnin_steps = 500
adaptive_hmc = tfp.mcmc.SimpleStepSizeAdaptation(
    tfp.mcmc.HamiltonianMonteCarlo(
        target_log_prob_fn=target_log_prob_fn,
        num_leapfrog_steps=4,
        step_size=0.01),
    num_adaptation_steps=int(num_burnin_steps * 0.8))

samples, is_accepted = tfp.mcmc.sample_chain(
    num_results=num_results,
    num_burnin_steps=num_burnin_steps,
    current_state=tf.zeros((ndims, 1)),
    kernel=adaptive_hmc,
    trace_fn=lambda _, pkr: pkr.inner_results.is_accepted)

Edward 2

version 0.8.0-dev20190721

Documentation.

Uses tensorflow for automatic differentiation. This is a submodule of TensorFlow Probability that provides a flexible API based on the ideas from the popular Edward libraryThe same developer, so the similarity is not surprising.. Note the biggest change from the tfp code above is that defining the joint probability is a bit cleaner, but we still use the same code to actually get the samples.

from tensorflow_probability import edward2 as ed
import tensorflow as tf

X_tensor = tf.convert_to_tensor(X, dtype='float32')

def linear_regression(X):
    w = ed.Normal(loc=tf.zeros((ndims, 1)), scale=1.0, name="w")
    y = ed.Normal(loc=tf.matmul(X, w), scale=0.1, name='y')
    return y

log_joint = ed.make_log_joint_fn(linear_regression)


def target_log_prob_fn(w):
    return log_joint(X_tensor, w=w, y=y_obs.reshape(-1, 1))

# Below here is from TFP section
num_results = 1000
num_burnin_steps = 500

adaptive_hmc = tfp.mcmc.SimpleStepSizeAdaptation(
    tfp.mcmc.HamiltonianMonteCarlo(
        target_log_prob_fn=target_log_prob_fn,
        num_leapfrog_steps=4,
        step_size=0.01),
    num_adaptation_steps=int(num_burnin_steps * 0.8))

ed_samples, is_accepted = tfp.mcmc.sample_chain(
    num_results=num_results,
    num_burnin_steps=num_burnin_steps,
    current_state=tf.zeros((ndims, 1)),
    kernel=adaptive_hmc,
    trace_fn=lambda _, pkr: pkr.inner_results.is_accepted)

Numpyro

version 0.1.0

Documentation.

Uses jax for automatic differentiation. Built by the same people who work on Pyro, and includes a very cool iterative implemenation of the NUTS algorithmI have not looked closely enough to see if this is "NUTS as in the 2013 paper", or "NUTS as in the algorithm implemented by Stan/PyMC3.". Note the import section looks a little overwhelming, but jax is a wonderful library that wraps numpy functionality. A more typical usage would be to import jax.numpy as np, so that no one was any wiser that you are using autodiff. This looks very similar to the Pyro implementation, but notice I am jumping through fewer hoops to convert my numpy arrays into torch tensors.

import jax.numpy as jnp
from jax import random

import numpyro.distributions as dist
from numpyro.handlers import sample
from numpyro.hmc_util import initialize_model
from numpyro.mcmc import mcmc


def model(X):
    w = sample('w', dist.Normal(jnp.zeros(ndims), jnp.ones(ndims)))
    y = sample('y', dist.Normal(jnp.matmul(X, w), 0.1 * jnp.ones(ndata)), obs=y_obs)

rng = random.PRNGKey(0)

init_params, potential_fn, constrain_fn = initialize_model(rng, model, X=X)
num_warmup, num_samples = 1000, 2000

# Run NUTS.
npyro_samples = mcmc(num_warmup, num_samples, init_params,
                 potential_fn=potential_fn,
                 trajectory_length=10,
                 constrain_fn=constrain_fn)

Brancher

version 0.3.4

Documentation.

NOTE: Very new library. As far as I can tell, this does not implement MCMC (yet?), nor can I calculate log probabilities to verify my implementation. It looks nice, though!

Uses pytorch for automatic differentiation. This came out this week, and I gave it a test drive, but could not do much damage with it yet.

from brancher.variables import ProbabilisticModel
from brancher.standard_variables import NormalVariable
from brancher import inference
import brancher.functions as BF
import torch

X_tensor = torch.as_tensor(X, dtype=torch.float32)

# Model
w = NormalVariable(loc=torch.zeros(ndims), scale=1., name="w")
y = NormalVariable(loc=BF.matmul(X_tensor, w), scale=0.1, name="y")

y.observe(y_obs)
model = ProbabilisticModel([w, y])

PyMC4

version 0.0.1

Documentation.

NOTE: Still working on the API. Don't use this yet.

Uses tensorflow probability (and hence TensorFlow) for automatic differentiation. I am including this for what the model definition syntax is looking like right now, though some work needs to happen to wire the model through to the proper TensorFlow Probability functions.

import pymc4
import tensorflow as tf

@pymc4.model()
def linear_model():
    w = yield pymc4.distributions.Normal('w', mu=np.zeros((5, 1)), sigma=1.)
    y = yield pymc4.distributions.Normal('y', mu=tf.matmul(X, w), sigma=0.1)

PyProb

version 0.13.0

Documentation.

Joins emcee in not using a gradient based sampler -- here we use importance sampling. I also had to cheat a little bit and take 50,000 samples instead of 1,000 to get the estimates to be reasonable. The samples are pretty fast, but ~~I suspect this does not scale to higher dimensions~~Now I have a paper to read about how importance sampling can scale. (but I also suspect that this does allow for defining a more broad class of models than most of the other languages). Defining this model was pretty nice, and I am surprised more languages do not subclass a Model object.

import torch

import pyprob
from pyprob import Model
from pyprob.distributions import Normal


class LinearModel(Model):
    def forward(self):
        X_tensor = torch.as_tensor(X, dtype=torch.float32)

        w = pyprob.sample(Normal(torch.zeros(ndims), torch.ones(ndims)))
        y = Normal(torch.matmul(X_tensor, w), 0.1 * torch.ones(ndata))

        pyprob.observe(y, name='y_obs')
        return w

model = LinearModel()

posterior = model.posterior_distribution(
    num_traces=50_000,
    inference_engine=pyprob.InferenceEngine.IMPORTANCE_SAMPLING,
    observe={'y_obs': torch.as_tensor(y_obs, dtype=torch.float32)})

Accuracy

I have confidence that all the libraries implement their algorithm of choice correctly, and I tried to write a simple model that would provide some reasonable samples from the correct posterior. This section only tries to verify that my implementations are reasonable.

To say again with a stronger font, any errors shown here are (probably) MCMC error, and should not be interpreted as an implementation being "good" or "bad". It is a bigger, more careful job to benchmark performance, which is definitely not what this chart does.