import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import nbinom
Count Regression with Variable Exposure
"arviz-darkgrid")
az.style.use(= 7355608 SEED
This example is based on the “Roaches” example from Regression and Other Stories by Gelman, Hill, and Vehtari. The example is a count regression model with an offset term.
The data is the number of roaches caught in 262 apartments. Some pest control treatment was applied to 158 (treatment=1) of the apartments, and 104 apartments received no treatment (treatment=0). The other columns in the data are:
y
: the number of roaches caughtroach1
: the pre-treatment roach levelsenior
: indicator for whether the appartment is for seniorsexposure2
: the number of trap-days (number of traps x number of days).
= pd.read_csv("data/roaches.csv", index_col=0)
roaches # rescale
"roach1"] = roaches["roach1"] / 100
roaches[ roaches.head()
y | roach1 | treatment | senior | exposure2 | |
---|---|---|---|---|---|
1 | 153 | 3.0800 | 1 | 0 | 0.800000 |
2 | 127 | 3.3125 | 1 | 0 | 0.600000 |
3 | 7 | 0.0167 | 1 | 0 | 1.000000 |
4 | 7 | 0.0300 | 1 | 0 | 1.000000 |
5 | 0 | 0.0200 | 1 | 0 | 1.142857 |
Poisson regression
One way to model this is to say that there is some rate of roaches per trap-day , and that the number of roaches caught is a Poisson random variable with a rate that is proportional to the number of trap-days (the exposure). That is:
\[ \begin{align*} y_i &\sim \text{Poisson}(\text{exposure2}_i \times \rho_i) \\ \log(\rho_i) &= \beta_0 + \beta_1 \text{treatment}_i + \beta_2 \text{roach1}_i + \beta_3 \text{senior}_i \end{align*} \]
With a little algebra, we can rewrite this as a generalized linear model:
\[ \begin{align*} y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \beta_0 + \beta_1 \text{treatment}_i + \beta_2 \text{roach1}_i + \beta_3 \text{senior}_i + \log(\text{exposure2}_i) \end{align*} \]
However, we don’t want to estimate a coefficient for \(\log(\text{exposure2})\), we want to simply add it as an offset. In bambi
we do this by using the offset
function in the formula to specify that a term should not be multiplied by a coefficient to estimate and simply added. The formula for the model is then:
"y ~ roach1 + treatment + senior + offset(log(exposure2))"
If you are familiar with R this offset term is the same as the offset term in the glm
function.
# bambi poisson model
= bmb.Model("y ~ roach1 + treatment + senior + offset(log(exposure2))", family = "poisson", data = roaches)
model_1 = model_1.fit() idata_1
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, roach1, treatment, senior]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.
az.summary(idata_1)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 3.089 | 0.020 | 3.052 | 3.128 | 0.000 | 0.0 | 4436.0 | 3676.0 | 1.0 |
roach1 | 0.698 | 0.009 | 0.681 | 0.714 | 0.000 | 0.0 | 3891.0 | 3221.0 | 1.0 |
senior | -0.381 | 0.032 | -0.442 | -0.321 | 0.001 | 0.0 | 3889.0 | 3001.0 | 1.0 |
treatment | -0.517 | 0.024 | -0.566 | -0.474 | 0.000 | 0.0 | 4742.0 | 3310.0 | 1.0 |
The sampling seems to have gone well based on ess
and r_hat
. If this were a real analysis one would also need to check priors, trace plots and other diagnostics. However, lets see if the model predicts the distribution of roaches (y
) observed. We can do this by looking at the posterior predictive distribution for the model. We plot the log of y
to make the results easier to see.
def plot_log_posterior_ppc(model, idata):
# plot posterior predictive check
='response', inplace=True)
model.predict(idata, kind= 'log(y+1)'
var_name # there is probably a better way
= np.log(idata.posterior_predictive['y'] + 1)
idata.posterior_predictive[var_name] = np.log(idata.observed_data['y'] + 1)
idata.observed_data[var_name]
return az.plot_ppc(idata, var_names=[var_name])
plot_log_posterior_ppc(model_1, idata_1)
It appears that we are drastically under predicting the number of apartments with a small number of roaches. This suggests creating a test statistic measuring the fraction of zeros, both in the observed data and in the simulated replications (posterior predictions). We can then use this to check the model fit.
# check number of zeros in y
def check_zeros(idata):
# flatten over chains:
= (idata.posterior_predictive["y"]==0).mean(("__obs__")).values.flatten()
sampled_zeros print(f"Fraction of zeros in the observed data: {np.mean(roaches['y']==0)}")
print(f"Fraction of zeros in the posterior predictive check: {np.mean(sampled_zeros)}")
print(f" 80% CI: {np.percentile(sampled_zeros, [10, 90])}")
check_zeros(idata_1)
Fraction of zeros in the observed data: 0.35877862595419846
Fraction of zeros in the posterior predictive check: 0.0007022900763358779
80% CI: [0. 0.00381679]
The Poisson model here does not succeed in reproducing the observed fraction of zeros. In the data we have about 36% zeros, while in the replications we almost always have no zeros or very few. Gelman, Hill, and Vehtari suggest we try an overdispersed and more flexible model like the negative binomial.
Negative Binomial Fit
# bambi poisson model
= bmb.Model("y ~ roach1 + treatment + senior + offset(log(exposure2))", family = "negativebinomial", data = roaches)
model_2 = model_2.fit() idata_2
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, Intercept, roach1, treatment, senior]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.
az.summary(idata_2)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 2.855 | 0.234 | 2.433 | 3.306 | 0.003 | 0.002 | 5333.0 | 3562.0 | 1.0 |
alpha | 0.272 | 0.026 | 0.224 | 0.322 | 0.000 | 0.000 | 5228.0 | 3369.0 | 1.0 |
roach1 | 1.326 | 0.258 | 0.853 | 1.806 | 0.004 | 0.003 | 4783.0 | 3311.0 | 1.0 |
senior | -0.333 | 0.267 | -0.824 | 0.177 | 0.003 | 0.003 | 6074.0 | 3265.0 | 1.0 |
treatment | -0.787 | 0.250 | -1.257 | -0.327 | 0.003 | 0.002 | 5756.0 | 3379.0 | 1.0 |
plot_log_posterior_ppc(model_2, idata_2)
check_zeros(idata_2)
Fraction of zeros in the observed data: 0.35877862595419846
Fraction of zeros in the posterior predictive check: 0.338175572519084
80% CI: [0.28625954 0.39312977]
def plot_zeros(ax, idata, model_label, **kwargs):
= np.mean(roaches['y']==0)
data_zeros # flatten over chains:
= (idata.posterior_predictive["y"]==0).mean(("__obs__")).values.flatten()
sampled_zeros =0.5, **kwargs)
ax.hist(sampled_zeros, alpha='red', linestyle='--')
ax.axvline(data_zeros, color"Fraction of zeros")
ax.set_xlabel(f"Model: {model_label}")
ax.set_title(False)
ax.yaxis.set_visible('white')
ax.set_facecolor(return ax
= plt.subplots(1,2, gridspec_kw={'wspace': 0.2})
fig, ax 0],idata_1, "Poisson", bins= 2) # use 2 bins to make it more clear. Almost no zeros.
plot_zeros(ax[1],idata_2, "Negative Binomial")
plot_zeros(ax[
"Observed data", "Posterior predictive"], loc='center left', bbox_to_anchor=(0.05, 0.8)) fig.legend([
The negative binomial distribution fit works much better, predicting the number of zeros consistent with the observed data.
Regression and Other Stories introduces a further improvement by introducing a zero-inflated regression later in the chapter, but I will not persue that here, after all the point of this example is to illustrate the use of offsets.
PYMC equivalent model
The model behind the scenes looks like this for the Poission model.
= model_1.backend
pymc_model pymc_model.model
\[ \begin{array}{rcl} \text{Intercept} &\sim & \operatorname{Normal}(0,~4.52)\\\text{roach1} &\sim & \operatorname{Normal}(0,~3.33)\\\text{treatment} &\sim & \operatorname{Normal}(0,~5.11)\\\text{senior} &\sim & \operatorname{Normal}(0,~5.43)\\\text{mu} &\sim & \operatorname{Deterministic}(f(\text{senior},~\text{treatment},~\text{roach1},~\text{Intercept}))\\\text{y} &\sim & \operatorname{Poisson}(\text{mu}) \end{array} \]
Let’s look at the equivalent (Poisson) model in PYMC:
# recreate the model using pymc
import pymc as pm
with pm.Model() as model_pymc:
# priors
= pm.Normal("Intercept", mu=0, sigma=4.5)
alpha = pm.Normal("beta_roach1", mu=0, sigma=3.3)
beta_roach1 = pm.Normal("beta_treatment", mu=0, sigma=5.11)
beta_treatment = pm.Normal("beta_senior", mu=0, sigma=5.43)
beta_senior
# likelihood
= pm.math.exp(alpha + beta_roach1 * roaches["roach1"] +
mu * roaches["treatment"] +
beta_treatment * roaches["senior"] +
beta_senior "exposure2"])) # no beta for exposure2
pm.math.log(roaches[= pm.Poisson("y", mu=mu, observed=roaches["y"])
y
= pm.sample(1000)
idata_pymc
az.summary(idata_pymc)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, beta_roach1, beta_treatment, beta_senior]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 18 seconds.
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 3.089 | 0.022 | 3.049 | 3.131 | 0.000 | 0.0 | 2079.0 | 2250.0 | 1.0 |
beta_roach1 | 0.698 | 0.009 | 0.681 | 0.715 | 0.000 | 0.0 | 2756.0 | 2731.0 | 1.0 |
beta_senior | -0.380 | 0.033 | -0.444 | -0.318 | 0.001 | 0.0 | 3018.0 | 2363.0 | 1.0 |
beta_treatment | -0.517 | 0.025 | -0.565 | -0.469 | 0.001 | 0.0 | 2520.0 | 2340.0 | 1.0 |
In this model (model_pymc
) we have the equivalent Poisson regression with everything explicit to illustrate what the ‘offset’ function is doing. It simply makes it possible to express a term like this in the formulae
string in a bambi
model.