# Beta Regression#

This example has been contributed by Tyler James Burch (@tjburch on GitHub).

:

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import expit

:

az.style.use("arviz-darkgrid")


In this example, we’ll look at using the Beta distribution for regression models. The Beta distribution is a probability distribution bounded on the interval [0, 1], which makes it well-suited to model probabilities or proportions. In fact, in much of the Bayesian literature, the Beta distribution is introduced as a prior distribution for the probability $$p$$ parameter of the Binomial distribution (in fact, it’s the conjugate prior for the Binomial distribution).

## Simulated Beta Distribution#

To start getting an intuitive sense of the Beta distribution, we’ll model coin flipping probabilities. Say we grab all the coins out of our pocket, we might have some fresh from the mint, but we might also have some old ones. Due to the variation, some may be slightly biased toward heads or tails, and our goal is to model distribution of the probabilities of flipping heads for the coins in our pocket.

Since we trust the mint, we’ll say the $$\alpha$$ and $$\beta$$ are both large, we’ll use 1,000 for each, which gives a distribution spanning from 0.45 to 0.55.

:

alpha = 1_000
beta = 1_000
p = np.random.beta(alpha, beta, size=10_000)
az.plot_kde(p)
plt.xlabel("$p$"); Next, we’ll use Bambi to try to recover the parameters of the Beta distribution. Since we have no predictors, we can do a intercept-only model to try to recover them.

:

data = pd.DataFrame({"probabilities": p})
model = bmb.Model("probabilities ~ 1", data, family="beta")
fitted = model.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, probabilities_kappa]

100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.

:

az.plot_trace(fitted); :

az.summary(fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 0.000 0.00 -0.000 0.001 0.000 0.000 3947.0 3081.0 1.0
probabilities_kappa 1988.336 28.22 1935.757 2041.255 0.444 0.314 4043.0 3007.0 1.0

The model fit, but clearly these parameters are not the ones that we used above. For Beta regression, we use a linear model for the mean, so we use the $$\mu$$ and $$\sigma$$ formulation. To link the two, we use

$$\alpha = \mu \kappa$$

$$\beta = (1-\mu)\kappa$$

and $$\kappa$$ is a function of the mean and variance,

$$\kappa = \frac{\mu(1-\mu)}{\sigma^2} - 1$$

Rather than $$\sigma$$, you’ll note Bambi returns $$\kappa$$. We’ll define a function to retrieve our original parameters.

:

def mukappa_to_alphabeta(mu, kappa):
# Calculate alpha and beta
alpha = mu * kappa
beta = (1 - mu) * kappa

# Get mean values and 95% HDIs
alpha_mean = alpha.mean(("chain", "draw")).item()
alpha_hdi = az.hdi(alpha, hdi_prob=.95)["x"].values
beta_mean = beta.mean(("chain", "draw")).item()
beta_hdi = az.hdi(beta, hdi_prob=.95)["x"].values

return alpha_mean, alpha_hdi, beta_mean, beta_hdi

alpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(
expit(fitted.posterior["Intercept"]),
fitted.posterior["probabilities_kappa"]
)

print(f"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi)} - {np.round(alpha_hdi)}")
print(f"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi)} - {np.round(beta_hdi)}")

Alpha - mean: 994.0, 95% HDI: 968.0 - 1024.0
Beta - mean: 994.0, 95% HDI: 968.0 - 1023.0

:

def mukappa_to_alphabeta(mu, kappa):
# Calculate alpha and beta
alpha = mu * kappa
beta = (1 - mu) * kappa

# Get mean values and 95% HDIs
alpha_mean = alpha.mean(("chain", "draw")).item()
alpha_hdi = az.hdi(alpha, hdi_prob=.95)["x"].values
beta_mean = beta.mean(("chain", "draw")).item()
beta_hdi = az.hdi(beta, hdi_prob=.95)["x"].values

return alpha_mean, alpha_hdi, beta_mean, beta_hdi

alpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(
expit(fitted.posterior["Intercept"]),
fitted.posterior["probabilities_kappa"]
)

print(f"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi)} - {np.round(alpha_hdi)}")
print(f"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi)} - {np.round(beta_hdi)}")

Alpha - mean: 994.0, 95% HDI: 968.0 - 1024.0
Beta - mean: 994.0, 95% HDI: 968.0 - 1023.0


We’ve managed to recover our parameters with an intercept-only model.

## Beta Regression with Predictors#

Perhaps we have a little more information on the coins in our pocket. We notice that the coins have accumulated dirt on either side, which would shift the probability of getting a tails or heads. In reality, we would not know how much the dirt affects the probability distribution, we would like to recover that parameter. We’ll construct this toy example by saying that each micron of dirt shifts the $$\alpha$$ parameter by 5.0. Further, the amount of dirt is distributed according to a Half Normal distribution with a standard deviation of 25 per side.

We’ll start by looking at the difference in probability for a coin with a lot of dirt on either side.

:

effect_per_micron = 5.0

# Clean Coin
alpha = 1_000
beta = 1_000
p = np.random.beta(alpha, beta, size=10_000)

p_heads = np.random.beta(alpha + 50 * effect_per_micron, beta, size=10_000)
p_tails = np.random.beta(alpha - 50 * effect_per_micron, beta, size=10_000)

az.plot_kde(p, label="Clean Coin")
az.plot_kde(p_tails, label="Biased toward tails", plot_kwargs={"color":"C2"})
plt.xlabel("$p$")
plt.ylim(top=plt.ylim()*1.25); Next, we’ll generate a toy dataset according to our specifications above. As an added foil, we will also assume that we’re limited in our measuring equipment, that we can only measure correctly to the nearest integer micron.

:

# Create amount of dirt on top and bottom
tails_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)

# Create the probability per coin
alpha = np.repeat(1_000, 1_000)
alpha = alpha + effect_per_micron * heads_bias_dirt - effect_per_micron * tails_bias_dirt
beta = np.repeat(1_000, 1_000)

p = np.random.beta(alpha, beta)

df = pd.DataFrame({
"p" : p,
"tails_bias_dirt" : tails_bias_dirt.round()
})

:

0 0.476742 5.0 32.0
1 0.465821 12.0 20.0
2 0.474213 0.0 15.0
3 0.573222 54.0 5.0
4 0.490487 23.0 29.0

Taking a look at our new dataset:

:

fig,ax = plt.subplots(1,3, figsize=(16,5))

df["p"].plot.kde(ax=ax)
ax.set_xlabel("$p$")

ax.set_xlabel("Measured Dirt Biasing Toward Heads ($\mu m$)")
df["tails_bias_dirt"].plot.hist(ax=ax, bins=np.arange(0,df["tails_bias_dirt"].max()))
ax.set_xlabel("Measured Dirt Biasing Toward Tails ($\mu m$)"); Next we want to make a model to recover the effect per micron of dirt per side. So far, we’ve considered the biasing toward one side or another independently. A linear model might look something like this:

$p \text{ ~ Beta}(:nbsphinx-math:mu, \sigma)$

$$logit(\mu) = \text{ Normal}( \alpha + \beta_h d_h + \beta_t d_t)$$

Where $$d_h$$ and $$d_t$$ are the measured dirt (in microns) biasing the probability toward heads and tails respectively, $$\beta_h$$ and $$\beta_t$$ are coefficients for how much a micron of dirt affects each independent side, and $$\alpha$$ is the intercept. Also note the logit link function used here, since our outcome is on the scale of 0-1, it makes sense that the link must also put our mean on that scale. Logit is the default link function, however Bambi supports the identity, probit, and cloglog links as well.

In this toy example, we’ve constructed it such that dirt should not affect one side differently from another, so we can wrap those into one coefficient: $$\beta = \beta_h = -\beta_t$$. This makes the last line of the model:

$$logit(\mu) = \text{ Normal}( \alpha + \beta \Delta d)$$

where

$$\Delta d = d_h - d_t$$

Putting that into our dataset, then constructing this model in Bambi,

:

df["delta_d"] = df["heads_bias_dirt"] - df["tails_bias_dirt"]
dirt_model = bmb.Model("p ~ delta_d", df, family="beta")
dirt_fitted = dirt_model.fit()
dirt_model.predict(dirt_fitted, kind="pps")

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, delta_d, p_kappa]

100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.

:

az.summary(dirt_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -0.003 0.001 -0.005 0.000 0.000 0.000 5312.0 3078.0 1.0
delta_d 0.005 0.000 0.005 0.005 0.000 0.000 4881.0 2684.0 1.0
p_kappa 1899.666 85.987 1732.301 2055.125 1.141 0.812 5688.0 3324.0 1.0
:

az.plot_ppc(dirt_fitted); Next, we’ll see if we can in fact recover the effect on $$\alpha$$. Remember that in order to return to $$\alpha$$, $$\beta$$ space, the linear equation passes through an inverse logit transformation, so we must apply this to the coefficient on $$\Delta d$$ to get the effect on $$\alpha$$. The inverse logit is nicely defined in scipy.special as expit.

:

mean_effect = expit(dirt_fitted.posterior.delta_d.mean())
hdi = az.hdi(dirt_fitted.posterior.delta_d, hdi_prob=.95)
lower = expit(hdi.delta_d)
upper = expit(hdi.delta_d)
print(f"Mean effect: {mean_effect.item():.4f}")
print(f"95% interval {lower.item():.4f} - {upper.item():.4f}")

Mean effect: 0.5013
95% interval 0.5013 - 0.5013


The recovered effect is very close to the true effect of 0.5.

## Example - Revisiting Baseball Data#

In the Hierarchical Logistic regression with Binomial family notebook, we modeled baseball batting averages (times a player reached first via a hit per times at bat) via a Hierarchical Logisitic regression model. If we’re interested in league-wide effects, we could look at a Beta regression. We work off the assumption that the league-wide batting average follows a Beta distribution, and that individual player’s batting averages are samples from that distribution.

First, load the Batting dataset again, and re-calculate batting average as hits/at-bat. In order to make sure that we have a sufficient sample, we’ll require at least 100 at-bats in order consider a batter. Last, we’ll focus on 1990-2018.

:

batting = bmb.load_data("batting")

:

batting["batting_avg"] = batting["H"] / batting["AB"]
batting = batting[batting["AB"] > 100]
df = batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ]

:

df.batting_avg.hist(bins=30)
plt.xlabel("Batting Average")
plt.ylabel("Count"); If we’re interested in modeling the distribution of batting averages, we can start with an intercept-only model.

:

model_avg = bmb.Model("batting_avg ~ 1", df, family="beta")
avg_fitted = model_avg.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, batting_avg_kappa]

100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.

:

az.summary(avg_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -1.038 0.002 -1.041 -1.035 0.000 0.000 3745.0 2913.0 1.0
batting_avg_kappa 152.524 2.003 148.859 156.326 0.032 0.023 3833.0 2598.0 1.0

Looking at the posterior predictive,

:

posterior_predictive = model_avg.predict(avg_fitted, kind="pps")

:

az.plot_ppc(avg_fitted); This appears to fit reasonably well. If, for example, we were interested in simulating players, we could sample from this distribution.

However, we can take this further. Say we’re interested in understanding how this distribution shifts if we know a player’s batting average in a previous year. We can condition the model on a player’s n-1 year, and use Beta regrssion to model that. Let’s construct that variable, and for sake of ease, we will ignore players without previous seasons.

:

# Add the player's batting average in the n-1 year
batting["batting_avg_shift"] = np.where(
batting["playerID"] == batting["playerID"].shift(),
batting["batting_avg"].shift(),
np.nan
)
df_shift = batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ]
df_shift = df_shift[~df_shift["batting_avg_shift"].isna()]
df_shift[["batting_avg_shift","batting_avg"]].corr()

:

batting_avg_shift batting_avg
batting_avg_shift 1.000000 0.229774
batting_avg 0.229774 1.000000

There is a lot of variance in year-to-year batting averages, it’s not known to be incredibly predictive, and we see that here. A correlation coefficient of 0.23 is only lightly predictive. However, we can still use it in our model to get a better understanding. We’ll fit two models. First, we’ll refit the previous, intercept-only, model using this updated dataset so we have an apples-to-apples comparison. Then, we’ll fit a model using the previous year’s batting average as a predictor.

:

model_avg = bmb.Model("batting_avg ~ 1", df_shift, family="beta")
avg_fitted = model_avg.fit()

model_lag = bmb.Model("batting_avg ~ batting_avg_shift", df_shift, family="beta")
lag_fitted = model_lag.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, batting_avg_kappa]

100.00% [8000/8000 00:01<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, batting_avg_shift, batting_avg_kappa]

100.00% [8000/8000 00:02<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.

:

az.summary(lag_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -1.376 0.072 -1.515 -1.247 0.001 0.001 5275.0 3044.0 1.0
batting_avg_shift 1.354 0.275 0.850 1.873 0.004 0.003 5402.0 3106.0 1.0
batting_avg_kappa 135.771 9.640 117.762 153.337 0.141 0.100 4702.0 2753.0 1.0
:

az.compare({
"intercept-only" : avg_fitted,
"lag-model": lag_fitted
})

:

rank elpd_loo p_loo elpd_diff weight se dse warning scale
lag-model 0 784.914632 3.108452 0.000000 0.995738 14.535129 0.000000 False log
intercept-only 1 774.082566 2.153033 10.832065 0.004262 15.343778 4.644965 False log

Adding the predictor results in a higher loo than the intercept-only model.

:

ppc= model_lag.predict(lag_fitted, kind="pps")
az.plot_ppc(lag_fitted);

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/home/osvaldo/proyectos/00_BM/bambi/docs/notebooks/beta_regression.ipynb Cell 47' in <cell line: 2>()
<a href='vscode-notebook-cell:/home/osvaldo/proyectos/00_BM/bambi/docs/notebooks/beta_regression.ipynb#ch0000045?line=0'>1</a> ppc= model_lag.predict(lag_fitted, kind="pps")
----> <a href='vscode-notebook-cell:/home/osvaldo/proyectos/00_BM/bambi/docs/notebooks/beta_regression.ipynb#ch0000045?line=1'>2</a> az.plot_ppc(lag_fitted)

File ~/proyectos/00_BM/arviz/arviz/plots/ppcplot.py:356, in plot_ppc(data, kind, alpha, mean, observed, color, colors, grid, figsize, textsize, data_pairs, var_names, filter_vars, coords, flatten, flatten_pp, num_pp_samples, random_seed, jitter, animated, animation_kwargs, legend, labeller, ax, backend, backend_kwargs, group, show)
354 # TODO: Add backend kwargs
355 plot = get_plotting_function("plot_ppc", "ppcplot", backend)
--> 356 axes = plot(**ppcplot_kwargs)
357 return axes

File ~/proyectos/00_BM/arviz/arviz/plots/backends/matplotlib/ppcplot.py:163, in plot_ppc(ax, length_plotters, rows, cols, figsize, animated, obs_plotters, pp_plotters, predictive_dataset, pp_sample_ix, kind, alpha, colors, textsize, mean, observed, jitter, total_pp_samples, legend, labeller, group, animation_kwargs, num_pp_samples, backend_kwargs, show)
161 vals = np.array([vals]).flatten()
162 if dtype == "f":
--> 163     pp_x, pp_density = kde(vals)
164     pp_densities.append(pp_density)
165     pp_xs.append(pp_x)

File ~/proyectos/00_BM/arviz/arviz/stats/density_utils.py:502, in kde(x, circular, **kwargs)
499 else:
500     kde_fun = _kde_linear
--> 502 return kde_fun(x, **kwargs)

File ~/proyectos/00_BM/arviz/arviz/stats/density_utils.py:588, in _kde_linear(x, bw, adaptive, extend, bound_correction, extend_fct, bw_fct, bw_return, custom_lims, cumulative, grid_len, **kwargs)
585 grid_counts, _, grid_edges = histogram(x, grid_len, (grid_min, grid_max))
587 # Bandwidth estimation
--> 588 bw = bw_fct * _get_bw(x, bw, grid_counts, x_std, x_range)
590 # Density estimation

File ~/proyectos/00_BM/arviz/arviz/stats/density_utils.py:157, in _get_bw(x, bw, grid_counts, x_std, x_range)
150         raise ValueError(
151             "Unrecognized bandwidth method.\n"
152             f"Input is: {bw_lower}.\n"
153             f"Expected one of: {list(_BW_METHODS_LINEAR)}."
154         )
156     bw_fun = _BW_METHODS_LINEAR[bw_lower]
--> 157     bw = bw_fun(x, grid_counts=grid_counts, x_std=x_std, x_range=x_range)
158 else:
159     raise ValueError(
160         "Unrecognized bw argument.\n"
161         "Expected a positive numeric or one of the following strings:\n"
162         f"{list(_BW_METHODS_LINEAR)}."
163     )

File ~/proyectos/00_BM/arviz/arviz/stats/density_utils.py:80, in _bw_experimental(x, grid_counts, x_std, x_range)
78 def _bw_experimental(x, grid_counts=None, x_std=None, x_range=None):
79     """Experimental bandwidth estimator."""
---> 80     bw_silverman = _bw_silverman(x, x_std=x_std)
81     bw_isj = _bw_isj(x, grid_counts=grid_counts, x_range=x_range)
82     return 0.5 * (bw_silverman + bw_isj)

File ~/proyectos/00_BM/arviz/arviz/stats/density_utils.py:29, in _bw_silverman(x, x_std, **kwargs)
27 if x_std is None:
28     x_std = np.std(x)
---> 29 q75, q25 = np.percentile(x, [75, 25])
30 x_iqr = q75 - q25
31 a = min(x_std, x_iqr / 1.34)

File <__array_function__ internals>:5, in percentile(*args, **kwargs)

File ~/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:3867, in percentile(a, q, axis, out, overwrite_input, interpolation, keepdims)
3865 if not _quantile_is_valid(q):
3866     raise ValueError("Percentiles must be in the range [0, 100]")
-> 3867 return _quantile_unchecked(
3868     a, q, axis, out, overwrite_input, interpolation, keepdims)

File ~/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:3986, in _quantile_unchecked(a, q, axis, out, overwrite_input, interpolation, keepdims)
3983 def _quantile_unchecked(a, q, axis=None, out=None, overwrite_input=False,
3984                         interpolation='linear', keepdims=False):
3985     """Assumes that q is in [0, 1], and is an ndarray"""
-> 3986     r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
3987                     overwrite_input=overwrite_input,
3988                     interpolation=interpolation)
3989     if keepdims:
3990         return r.reshape(q.shape + k)

File ~/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:3564, in _ureduce(a, func, **kwargs)
3561 else:
3562     keepdim = (1,) * a.ndim
-> 3564 r = func(a, **kwargs)
3565 return r, keepdim

File ~/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:4107, in _quantile_ureduce_func(***failed resolving arguments***)
4104     n = np.array(False, dtype=bool)
4106 weights_shape = indices.shape + (1,) * (ap.ndim - 1)
-> 4107 weights_above = not_scalar(indices - indices_below).reshape(weights_shape)
4109 x_below = take(ap, indices_below, axis=0)
4110 x_above = take(ap, indices_above, axis=0)

KeyboardInterrupt: The biggest question this helps us understand is, for each point of batting average in the previous year, how much better do we expect a player to be in the current year?

[ ]:

mean_effect = lag_fitted.posterior.batting_avg_shift.mean().item()
hdi = az.hdi(lag_fitted.posterior.batting_avg_shift, hdi_prob=.95)

lower = expit(hdi.batting_avg_shift).item()
upper = expit(hdi.batting_avg_shift).item()
print(f"Mean effect: {expit(mean_effect):.4f}")
print(f"95% interval {lower:.4f} - {upper:.4f}")

Mean effect: 0.7954
95% interval 0.6970 - 0.8701

[ ]:

az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=0.95, color="goldenrod", fill_kwargs={"alpha":0.8})
az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=.68, color="forestgreen", fill_kwargs={"alpha":0.8})

intercept = lag_fitted.posterior.Intercept.values.mean()
x = np.linspace(df_shift.batting_avg_shift.min(), df_shift.batting_avg_shift.max(),100)
linear = mean_effect * x + intercept
plt.plot(x, expit(linear), c="black")
plt.xlabel("Previous Year's Batting Average")
plt.ylabel("Batting Average"); [ ]:

%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Mon Jun 20 2022

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 8.3.0

matplotlib: 3.5.1
pandas    : 1.4.2
numpy     : 1.21.5
arviz     : 0.13.0.dev0
scipy     : 1.7.3
bambi     : 0.9.0
sys       : 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]

Watermark: 2.3.0