# Beta Regression#

This example has been contributed by Tyler James Burch (@tjburch on GitHub).

:

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
from scipy import stats
from scipy.special import expit

:

az.style.use("arviz-darkgrid")


In this example, we’ll look at using the Beta distribution for regression models. The Beta distribution is a probability distribution bounded on the interval [0, 1], which makes it well-suited to model probabilities or proportions. In fact, in much of the Bayesian literature, the Beta distribution is introduced as a prior distribution for the probability $$p$$ parameter of the Binomial distribution (in fact, it’s the conjugate prior for the Binomial distribution).

## Simulated Beta Distribution#

To start getting an intuitive sense of the Beta distribution, we’ll model coin flipping probabilities. Say we grab all the coins out of our pocket, we might have some fresh from the mint, but we might also have some old ones. Due to the variation, some may be slightly biased toward heads or tails, and our goal is to model distribution of the probabilities of flipping heads for the coins in our pocket.

Since we trust the mint, we’ll say the $$\alpha$$ and $$\beta$$ are both large, we’ll use 1,000 for each, which gives a distribution spanning from 0.45 to 0.55.

:

alpha = 1_000
beta = 1_000
p = np.random.beta(alpha, beta, size=10_000)
az.plot_kde(p)
plt.xlabel("$p$"); Next, we’ll use Bambi to try to recover the parameters of the Beta distribution. Since we have no predictors, we can do a intercept-only model to try to recover them.

:

data = pd.DataFrame({"probabilities": p})
model = bmb.Model("probabilities ~ 1", data, family="beta")
fitted = model.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [probabilities_kappa, Intercept]

100.00% [4000/4000 00:08<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 17 seconds.

:

az.plot_trace(fitted); :

az.summary(fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 0.001 0.000 -0.000 0.002 0.000 0.00 1698.0 1420.0 1.0
probabilities_kappa 1981.761 27.811 1929.028 2032.422 0.693 0.49 1610.0 1409.0 1.0

The model fit, but clearly these parameters are not the ones that we used above. For Beta regression, we use a linear model for the mean, so we use the $$\mu$$ and $$\sigma$$ formulation. To link the two, we use

$$\alpha = \mu \kappa$$

$$\beta = (1-\mu)\kappa$$

and $$\kappa$$ is a function of the mean and variance,

$$\kappa = \frac{\mu(1-\mu)}{\sigma^2} - 1$$

Rather than $$\sigma$$, you’ll note Bambi returns $$\kappa$$. We’ll define a function to retrieve our original parameters.

:

def mukappa_to_alphabeta(mu, kappa):
# Calculate alpha and beta
alpha = mu * kappa
beta = (1 - mu) * kappa

# Get mean values and 95% HDIs
alpha_mean = alpha.mean(["chain", "draw"]).values
alpha_hdi = az.hdi(alpha, hdi_prob=.95)["x"].values
beta_mean = beta.mean(["chain", "draw"]).values
beta_hdi = az.hdi(beta, hdi_prob=.95)["x"].values

return alpha_mean, alpha_hdi, beta_mean, beta_hdi

alpha, alpha_hdi, beta, beta_hdi = mukappa_to_alphabeta(
expit(fitted.posterior["Intercept"]),
fitted.posterior["probabilities_kappa"]
)

print(f"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi)} - {np.round(alpha_hdi)}")
print(f"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi)} - {np.round(beta_hdi)}")

Alpha - mean: 991.0, 95% HDI: 964.0 - 1018.0
Beta - mean: 991.0, 95% HDI: 963.0 - 1016.0


We’ve managed to recover our parameters with an intercept-only model.

## Beta Regression with Predictors#

Perhaps we have a little more information on the coins in our pocket. We notice that the coins have accumulated dirt on either side, which would shift the probability of getting a tails or heads. In reality, we would not know how much the dirt affects the probability distribution, we would like to recover that parameter. We’ll construct this toy example by saying that each micron of dirt shifts the $$\alpha$$ parameter by 5.0. Further, the amount of dirt is distributed according to a Half Normal distribution with a standard deviation of 25 per side.

We’ll start by looking at the difference in probability for a coin with a lot of dirt on either side.

:

effect_per_micron = 5.0

# Clean Coin
alpha = 1_000
beta = 1_000
p = np.random.beta(alpha, beta, size=10_000)

p_heads = np.random.beta(alpha + 50 * effect_per_micron, beta, size=10_000)
p_tails = np.random.beta(alpha - 50 * effect_per_micron, beta, size=10_000)

az.plot_kde(p, label="Clean Coin")
az.plot_kde(p_tails, label="Biased toward tails", plot_kwargs={"color":"C2"})
plt.xlabel("$p$")
plt.ylim(top=plt.ylim()*1.25); Next, we’ll generate a toy dataset according to our specifications above. As an added foil, we will also assume that we’re limited in our measuring equipment, that we can only measure correctly to the nearest integer micron.

:

# Create amount of dirt on top and bottom
tails_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)

# Create the probability per coin
alpha = np.repeat(1_000, 1_000)
alpha = alpha + effect_per_micron * heads_bias_dirt - effect_per_micron * tails_bias_dirt
beta = np.repeat(1_000, 1_000)

p = np.random.beta(alpha, beta)

df = pd.DataFrame({
"p" : p,
"tails_bias_dirt" : tails_bias_dirt.round()
})

:

0 0.458738 1.0 27.0
1 0.508124 8.0 8.0
2 0.517902 69.0 42.0
3 0.514295 24.0 16.0
4 0.402952 6.0 59.0

Taking a look at our new dataset:

:

fig,ax = plt.subplots(1,3, figsize=(16,5))

df["p"].plot.kde(ax=ax)
ax.set_xlabel("$p$")

ax.set_xlabel("Measured Dirt Biasing Toward Heads ($\mu m$)")
df["tails_bias_dirt"].plot.hist(ax=ax, bins=np.arange(0,df["tails_bias_dirt"].max()))
ax.set_xlabel("Measured Dirt Biasing Toward Tails ($\mu m$)"); Next we want to make a model to recover the effect per micron of dirt per side. So far, we’ve considered the biasing toward one side or another independently. A linear model might look something like this:

$p \text{ ~ Beta}(:nbsphinx-math:mu, \sigma)$

$$logit(\mu) = \text{ Normal}( \alpha + \beta_h d_h + \beta_t d_t)$$

Where $$d_h$$ and $$d_t$$ are the measured dirt (in microns) biasing the probability toward heads and tails respectively, $$\beta_h$$ and $$\beta_t$$ are coefficients for how much a micron of dirt affects each independent side, and $$\alpha$$ is the intercept. Also note the logit link function used here, since our outcome is on the scale of 0-1, it makes sense that the link must also put our mean on that scale. Logit is the default link function, however Bambi supports the identity, probit, and cloglog links as well.

In this toy example, we’ve constructed it such that dirt should not affect one side differently from another, so we can wrap those into one coefficient: $$\beta = \beta_h = -\beta_t$$. This makes the last line of the model:

$$logit(\mu) = \text{ Normal}( \alpha + \beta \Delta d)$$

where

$$\Delta d = d_h - d_t$$

Putting that into our dataset, then constructing this model in Bambi,

:

df["delta_d"] = df["heads_bias_dirt"] - df["tails_bias_dirt"]
dirt_model = bmb.Model("p ~ delta_d", df, family="beta")
dirt_fitted = dirt_model.fit()
dirt_model.predict(dirt_fitted, kind="pps", draws=1000)

Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [p_kappa, delta_d, Intercept]

100.00% [4000/4000 00:18<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 31 seconds.

:

az.summary(dirt_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -0.006 0.002 -0.009 -0.003 0.000 0.000 2794.0 1210.0 1.01
delta_d 0.005 0.000 0.005 0.005 0.000 0.000 2627.0 1576.0 1.00
p_kappa 1711.008 77.400 1576.636 1857.040 1.408 0.996 3014.0 1531.0 1.00
:

az.plot_ppc(dirt_fitted); Next, we’ll see if we can in fact recover the effect on $$\alpha$$. Remember that in order to return to $$\alpha$$, $$\beta$$ space, the linear equation passes through an inverse logit transformation, so we must apply this to the coefficient on $$\Delta d$$ to get the effect on $$\alpha$$. The inverse logit is nicely defined in scipy.special as expit.

:

mean_effect = expit(dirt_fitted.posterior.delta_d.mean())
hdi = az.hdi(dirt_fitted.posterior.delta_d, hdi_prob=.95)
lower = expit(hdi.delta_d)
upper = expit(hdi.delta_d)
print(f"Mean effect: {mean_effect.item():.4f}")
print(f"95% interval {lower.item():.4f} - {upper.item():.4f}")

Mean effect: 0.5013
95% interval 0.5012 - 0.5013


The recovered effect is very close to the true effect of 0.5.

## Example - Revisiting Baseball Data#

In the Hierarchical Logistic regression with Binomial family notebook, we modeled baseball batting averages (times a player reached first via a hit per times at bat) via a Hierarchical Logisitic regression model. If we’re interested in league-wide effects, we could look at a Beta regression. We work off the assumption that the league-wide batting average follows a Beta distribution, and that individual player’s batting averages are samples from that distribution.

First, load the Batting dataset again, and re-calculate batting average as hits/at-bat. In order to make sure that we have a sufficient sample, we’ll require at least 100 at-bats in order consider a batter. Last, we’ll focus on 1990-2018.

:

batting = bmb.load_data("batting")

:

batting["batting_avg"] = batting["H"] / batting["AB"]
batting = batting[batting["AB"] > 100]
df = batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ]

:

df.batting_avg.hist(bins=30)
plt.xlabel("Batting Average")
plt.ylabel("Count"); If we’re interested in modeling the distribution of batting averages, we can start with an intercept-only model.

:

model_avg = bmb.Model("batting_avg ~ 1", df, family="beta")
avg_fitted = model_avg.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [batting_avg_kappa, Intercept]

100.00% [4000/4000 00:09<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 16 seconds.

:

az.summary(avg_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -1.038 0.002 -1.041 -1.035 0.000 0.000 1591.0 1469.0 1.0
batting_avg_kappa 152.584 1.985 148.766 156.143 0.045 0.032 1948.0 1417.0 1.0

Looking at the posterior predictive,

:

posterior_predictive = model_avg.predict(avg_fitted, kind="pps", draws=1000)

:

az.plot_ppc(avg_fitted); This appears to fit reasonably well. If, for example, we were interested in simulating players, we could sample from this distribution.

However, we can take this further. Say we’re interested in understanding how this distribution shifts if we know a player’s batting average in a previous year. We can condition the model on a player’s n-1 year, and use Beta regrssion to model that. Let’s construct that variable, and for sake of ease, we will ignore players without previous seasons.

:

# Add the player's batting average in the n-1 year
batting["batting_avg_shift"] = np.where(
batting["playerID"] == batting["playerID"].shift(),
batting["batting_avg"].shift(),
np.nan
)
df_shift = batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ]
df_shift = df_shift[~df_shift["batting_avg_shift"].isna()]
df_shift[["batting_avg_shift","batting_avg"]].corr()

:

batting_avg_shift batting_avg
batting_avg_shift 1.000000 0.229774
batting_avg 0.229774 1.000000

There is a lot of variance in year-to-year batting averages, it’s not known to be incredibly predictive, and we see that here. A correlation coefficient of 0.23 is only lightly predictive. However, we can still use it in our model to get a better understanding. We’ll fit two models. First, we’ll refit the previous, intercept-only, model using this updated dataset so we have an apples-to-apples comparison. Then, we’ll fit a model using the previous year’s batting average as a predictor.

:

model_avg = bmb.Model("batting_avg ~ 1", df_shift, family="beta")
avg_fitted = model_avg.fit()

model_lag = bmb.Model("batting_avg ~ batting_avg_shift", df_shift, family="beta")
lag_fitted = model_lag.fit()

Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [batting_avg_kappa, Intercept]

100.00% [4000/4000 00:07<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 15 seconds.
Auto-assigning NUTS sampler...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [batting_avg_kappa, batting_avg_shift, Intercept]

100.00% [4000/4000 00:11<00:00 Sampling 2 chains, 0 divergences]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 19 seconds.

:

az.summary(lag_fitted)

:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept -1.376 0.075 -1.507 -1.215 0.001 0.001 2664.0 1330.0 1.0
batting_avg_shift 1.353 0.287 0.764 1.864 0.006 0.004 2699.0 1431.0 1.0
batting_avg_kappa 136.107 9.577 119.070 154.521 0.183 0.131 2774.0 1499.0 1.0
:

az.compare({
"intercept-only" : avg_fitted,
"lag-model": lag_fitted
})

:

rank loo p_loo d_loo weight se dse warning loo_scale
lag-model 0 784.847140 3.193623 0.000000 0.993019 14.578775 0.000000 False log
intercept-only 1 774.075162 2.163445 10.771978 0.006981 15.361378 4.640041 False log

Adding the predictor results in a higher loo than the intercept-only model.

:

ppc= model_lag.predict(lag_fitted, kind="pps", draws=1000)
az.plot_ppc(lag_fitted); The biggest question this helps us understand is, for each point of batting average in the previous year, how much better do we expect a player to be in the current year?

:

mean_effect = lag_fitted.posterior.batting_avg_shift.values.mean()
hdi = az.hdi(lag_fitted.posterior.batting_avg_shift, hdi_prob=.95)

lower = expit(hdi.batting_avg_shift).item()
upper = expit(hdi.batting_avg_shift).item()
print(f"Mean effect: {expit(mean_effect):.4f}")
print(f"95% interval {lower:.4f} - {upper:.4f}")

Mean effect: 0.7946
95% interval 0.6763 - 0.8681

:

az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=0.95, color="goldenrod", fill_kwargs={"alpha":0.8})
az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=.68, color="forestgreen", fill_kwargs={"alpha":0.8})

intercept = lag_fitted.posterior.Intercept.values.mean()
x = np.linspace(df_shift.batting_avg_shift.min(), df_shift.batting_avg_shift.max(),100)
linear = mean_effect * x + intercept
plt.plot(x, expit(linear), c="black")
plt.xlabel("Previous Year's Batting Average")
plt.ylabel("Batting Average"); 