```
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import expit
```

# Beta Regression

This example has been contributed by Tyler James Burch (@tjburch on GitHub).

`"arviz-darkgrid") az.style.use(`

In this example, we’ll look at using the **Beta distribution** for regression models. The Beta distribution is a probability distribution bounded on the interval [0, 1], which makes it well-suited to model probabilities or proportions. In fact, in much of the Bayesian literature, the Beta distribution is introduced as a prior distribution for the probability \(p\) parameter of the Binomial distribution (in fact, it’s the conjugate prior for the Binomial distribution).

## Simulated Beta Distribution

To start getting an intuitive sense of the Beta distribution, we’ll model coin flipping probabilities. Say we grab all the coins out of our pocket, we might have some fresh from the mint, but we might also have some old ones. Due to the variation, some may be slightly biased toward heads or tails, and our goal is to model distribution of the probabilities of flipping heads for the coins in our pocket.

Since we trust the mint, we’ll say the \(\alpha\) and \(\beta\) are both large, we’ll use 1,000 for each, which gives a distribution spanning from 0.45 to 0.55.

```
= 1_000
alpha = 1_000
beta = np.random.beta(alpha, beta, size=10_000)
p
az.plot_kde(p)"$p$"); plt.xlabel(
```

Next, we’ll use Bambi to try to recover the parameters of the Beta distribution. Since we have no predictors, we can do a intercept-only model to try to recover them.

```
= pd.DataFrame({"probabilities": p})
data = bmb.Model("probabilities ~ 1", data, family="beta")
model = model.fit() fitted
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [kappa, Intercept]
```

```
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
```

`; az.plot_trace(fitted)`

` az.summary(fitted)`

mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

Intercept | -0.00 | 0.000 | -0.001 | 0.001 | 0.000 | 0.000 | 2071.0 | 1229.0 | 1.0 |

kappa | 1986.48 | 28.041 | 1930.335 | 2035.428 | 0.588 | 0.416 | 2268.0 | 1656.0 | 1.0 |

The model fit, but clearly these parameters are not the ones that we used above. For Beta regression, we use a linear model for the mean, so we use the \(\mu\) and \(\sigma\) formulation. To link the two, we use

\(\alpha = \mu \kappa\)

\(\beta = (1-\mu)\kappa\)

and \(\kappa\) is a function of the mean and variance,

\(\kappa = \frac{\mu(1-\mu)}{\sigma^2} - 1\)

Rather than \(\sigma\), you’ll note Bambi returns \(\kappa\). We’ll define a function to retrieve our original parameters.

```
def mukappa_to_alphabeta(mu, kappa):
# Calculate alpha and beta
= mu * kappa
alpha = (1 - mu) * kappa
beta
# Get mean values and 95% HDIs
= alpha.mean(("chain", "draw")).item()
alpha_mean = az.hdi(alpha, hdi_prob=.95)["x"].values
alpha_hdi = beta.mean(("chain", "draw")).item()
beta_mean = az.hdi(beta, hdi_prob=.95)["x"].values
beta_hdi
return alpha_mean, alpha_hdi, beta_mean, beta_hdi
= mukappa_to_alphabeta(
alpha, alpha_hdi, beta, beta_hdi "Intercept"]),
expit(fitted.posterior["kappa"]
fitted.posterior[
)
print(f"Alpha - mean: {np.round(alpha)}, 95% HDI: {np.round(alpha_hdi[0])} - {np.round(alpha_hdi[1])}")
print(f"Beta - mean: {np.round(beta)}, 95% HDI: {np.round(beta_hdi[0])} - {np.round(beta_hdi[1])}")
```

```
Alpha - mean: 993.0, 95% HDI: 967.0 - 1022.0
Beta - mean: 993.0, 95% HDI: 967.0 - 1022.0
```

We’ve managed to recover our parameters with an intercept-only model.

## Beta Regression with Predictors

Perhaps we have a little more information on the coins in our pocket. We notice that the coins have accumulated dirt on either side, which would shift the probability of getting a tails or heads. In reality, we would not know how much the dirt affects the probability distribution, we would like to recover that parameter. We’ll construct this toy example by saying that each micron of dirt shifts the \(\alpha\) parameter by 5.0. Further, the amount of dirt is distributed according to a Half Normal distribution with a standard deviation of 25 per side.

We’ll start by looking at the difference in probability for a coin with a lot of dirt on either side.

```
= 5.0
effect_per_micron
# Clean Coin
= 1_000
alpha = 1_000
beta = np.random.beta(alpha, beta, size=10_000)
p
# Add two std to tails side (heads more likely)
= np.random.beta(alpha + 50 * effect_per_micron, beta, size=10_000)
p_heads # Add two std to heads side (tails more likely)
= np.random.beta(alpha - 50 * effect_per_micron, beta, size=10_000)
p_tails
="Clean Coin")
az.plot_kde(p, label="Biased toward heads", plot_kwargs={"color":"C1"})
az.plot_kde(p_heads, label="Biased toward tails", plot_kwargs={"color":"C2"})
az.plot_kde(p_tails, label"$p$")
plt.xlabel(=plt.ylim()[1]*1.25); plt.ylim(top
```

Next, we’ll generate a toy dataset according to our specifications above. As an added foil, we will also assume that we’re limited in our measuring equipment, that we can only measure correctly to the nearest integer micron.

```
# Create amount of dirt on top and bottom
= stats.halfnorm(loc=0, scale=25).rvs(size=1_000)
heads_bias_dirt = stats.halfnorm(loc=0, scale=25).rvs(size=1_000)
tails_bias_dirt
# Create the probability per coin
= np.repeat(1_000, 1_000)
alpha = alpha + effect_per_micron * heads_bias_dirt - effect_per_micron * tails_bias_dirt
alpha = np.repeat(1_000, 1_000)
beta
= np.random.beta(alpha, beta)
p
= pd.DataFrame({
df "p" : p,
"heads_bias_dirt" : heads_bias_dirt.round(),
"tails_bias_dirt" : tails_bias_dirt.round()
}) df.head()
```

p | heads_bias_dirt | tails_bias_dirt | |
---|---|---|---|

0 | 0.500848 | 8.0 | 1.0 |

1 | 0.492925 | 4.0 | 14.0 |

2 | 0.469713 | 14.0 | 44.0 |

3 | 0.474854 | 9.0 | 32.0 |

4 | 0.526697 | 23.0 | 1.0 |

Taking a look at our new dataset:

```
= plt.subplots(1,3, figsize=(16,5))
fig,ax
"p"].plot.kde(ax=ax[0])
df[0].set_xlabel("$p$")
ax[
"heads_bias_dirt"].plot.hist(ax=ax[1], bins=np.arange(0,df["heads_bias_dirt"].max()))
df[1].set_xlabel("Measured Dirt Biasing Toward Heads ($\mu m$)")
ax["tails_bias_dirt"].plot.hist(ax=ax[2], bins=np.arange(0,df["tails_bias_dirt"].max()))
df[2].set_xlabel("Measured Dirt Biasing Toward Tails ($\mu m$)"); ax[
```

Next we want to make a model to recover the effect per micron of dirt per side. So far, we’ve considered the biasing toward one side or another independently. A linear model might look something like this:

\[ \begin{aligned} p &\sim \text{Beta}(\mu, \sigma) \\ \text{logit}(\mu) &= \text{ Normal}( \alpha + \beta_h d_h + \beta_t d_t) \end{aligned} \]

Where \(d_h\) and \(d_t\) are the measured dirt (in microns) biasing the probability toward heads and tails respectively, \(\beta_h\) and \(\beta_t\) are coefficients for how much a micron of dirt affects each independent side, and \(\alpha\) is the intercept. Also note the logit link function used here, since our outcome is on the scale of 0-1, it makes sense that the link must also put our mean on that scale. Logit is the default link function, however Bambi supports the identity, probit, and cloglog links as well.

In this toy example, we’ve constructed it such that dirt should not affect one side differently from another, so we can wrap those into one coefficient: \(\beta = \beta_h = -\beta_t\). This makes the last line of the model:

\[ \text{logit}(\mu) = \text{ Normal}( \alpha + \beta \Delta d) \]

where \(\Delta d = d_h - d_t\).

Putting that into our dataset, then constructing this model in Bambi,

```
"delta_d"] = df["heads_bias_dirt"] - df["tails_bias_dirt"]
df[= bmb.Model("p ~ delta_d", df, family="beta")
dirt_model = dirt_model.fit()
dirt_fitted ="response") dirt_model.predict(dirt_fitted, kind
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [kappa, Intercept, delta_d]
```

```
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
```

` az.summary(dirt_fitted)`

mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

Intercept | -0.006 | 0.002 | -0.009 | -0.003 | 0.000 | 0.000 | 3212.0 | 1464.0 | 1.01 |

delta_d | 0.005 | 0.000 | 0.005 | 0.005 | 0.000 | 0.000 | 2652.0 | 1269.0 | 1.00 |

kappa | 1803.768 | 79.957 | 1644.702 | 1949.191 | 1.446 | 1.022 | 3094.0 | 1373.0 | 1.00 |

mu[0] | 0.508 | 0.000 | 0.507 | 0.508 | 0.000 | 0.000 | 3137.0 | 1430.0 | 1.00 |

mu[1] | 0.486 | 0.000 | 0.485 | 0.486 | 0.000 | 0.000 | 3172.0 | 1700.0 | 1.00 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

mu[995] | 0.484 | 0.000 | 0.483 | 0.485 | 0.000 | 0.000 | 3149.0 | 1728.0 | 1.00 |

mu[996] | 0.489 | 0.000 | 0.489 | 0.490 | 0.000 | 0.000 | 3252.0 | 1624.0 | 1.00 |

mu[997] | 0.531 | 0.001 | 0.530 | 0.532 | 0.000 | 0.000 | 3007.0 | 1627.0 | 1.00 |

mu[998] | 0.546 | 0.001 | 0.545 | 0.548 | 0.000 | 0.000 | 2921.0 | 1584.0 | 1.00 |

mu[999] | 0.528 | 0.001 | 0.527 | 0.529 | 0.000 | 0.000 | 3033.0 | 1627.0 | 1.00 |

1003 rows × 9 columns

`; az.plot_ppc(dirt_fitted)`

Next, we’ll see if we can in fact recover the effect on \(\alpha\). Remember that in order to return to \(\alpha\), \(\beta\) space, the linear equation passes through an inverse logit transformation, so we must apply this to the coefficient on \(\Delta d\) to get the effect on \(\alpha\). The inverse logit is nicely defined in `scipy.special`

as `expit`

.

```
= expit(dirt_fitted.posterior.delta_d.mean())
mean_effect = az.hdi(dirt_fitted.posterior.delta_d, hdi_prob=.95)
hdi = expit(hdi.delta_d[0])
lower = expit(hdi.delta_d[1])
upper print(f"Mean effect: {mean_effect.item():.4f}")
print(f"95% interval {lower.item():.4f} - {upper.item():.4f}")
```

```
Mean effect: 0.5013
95% interval 0.5013 - 0.5013
```

The recovered effect is very close to the true effect of 0.5.

## Example - Revisiting Baseball Data

In the Hierarchical Logistic regression with Binomial family notebook, we modeled baseball batting averages (times a player reached first via a hit per times at bat) via a Hierarchical Logisitic regression model. If we’re interested in league-wide effects, we could look at a Beta regression. We work off the assumption that the league-wide batting average follows a Beta distribution, and that individual player’s batting averages are samples from that distribution.

First, load the Batting dataset again, and re-calculate batting average as hits/at-bat. In order to make sure that we have a sufficient sample, we’ll require at least 100 at-bats in order consider a batter. Last, we’ll focus on 1990-2018.

`= bmb.load_data("batting") batting `

```
"batting_avg"] = batting["H"] / batting["AB"]
batting[= batting[batting["AB"] > 100]
batting = batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ] df
```

```
=30)
df.batting_avg.hist(bins"Batting Average")
plt.xlabel("Count"); plt.ylabel(
```

If we’re interested in modeling the distribution of batting averages, we can start with an intercept-only model.

```
= bmb.Model("batting_avg ~ 1", df, family="beta")
model_avg = model_avg.fit() avg_fitted
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [kappa, Intercept]
```

```
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
```

` az.summary(avg_fitted)`

mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

Intercept | -1.038 | 0.002 | -1.041 | -1.035 | 0.000 | 0.000 | 1903.0 | 1395.0 | 1.0 |

kappa | 152.503 | 2.062 | 148.731 | 156.325 | 0.046 | 0.033 | 2012.0 | 1219.0 | 1.0 |

Looking at the posterior predictive,

`= model_avg.predict(avg_fitted, kind="response") posterior_predictive `

`; az.plot_ppc(avg_fitted)`

This appears to fit reasonably well. If, for example, we were interested in simulating players, we could sample from this distribution.

However, we can take this further. Say we’re interested in understanding how this distribution shifts if we know a player’s batting average in a previous year. We can condition the model on a player’s n-1 year, and use Beta regrssion to model that. Let’s construct that variable, and for sake of ease, we will ignore players without previous seasons.

```
# Add the player's batting average in the n-1 year
"batting_avg_shift"] = np.where(
batting["playerID"] == batting["playerID"].shift(),
batting["batting_avg"].shift(),
batting[
np.nan
)= batting[ (batting["yearID"] > 1990) & (batting["yearID"] < 2018) ]
df_shift = df_shift[~df_shift["batting_avg_shift"].isna()]
df_shift "batting_avg_shift","batting_avg"]].corr() df_shift[[
```

batting_avg_shift | batting_avg | |
---|---|---|

batting_avg_shift | 1.000000 | 0.229774 |

batting_avg | 0.229774 | 1.000000 |

There is a lot of variance in year-to-year batting averages, it’s not known to be incredibly predictive, and we see that here. A correlation coefficient of 0.23 is only lightly predictive. However, we can still use it in our model to get a better understanding. We’ll fit two models. First, we’ll refit the previous, intercept-only, model using this updated dataset so we have an apples-to-apples comparison. Then, we’ll fit a model using the previous year’s batting average as a predictor.

Notice we need to explicitly ask for the inclusion of the log-likelihood values into the inference data object.

```
= bmb.Model("batting_avg ~ 1", df_shift, family="beta")
model_avg = model_avg.fit(idata_kwargs={"log_likelihood": True})
avg_fitted
= bmb.Model("batting_avg ~ batting_avg_shift", df_shift, family="beta")
model_lag = model_lag.fit(idata_kwargs={"log_likelihood": True}) lag_fitted
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [kappa, Intercept]
```

```
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [kappa, Intercept, batting_avg_shift]
```

```
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 3 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
```

` az.summary(lag_fitted)`

mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

Intercept | -1.377 | 0.073 | -1.521 | -1.250 | 0.001 | 0.001 | 2593.0 | 1287.0 | 1.0 |

batting_avg_shift | 1.359 | 0.278 | 0.873 | 1.887 | 0.005 | 0.004 | 2592.0 | 1365.0 | 1.0 |

kappa | 135.954 | 8.722 | 119.751 | 151.394 | 0.169 | 0.121 | 2658.0 | 1685.0 | 1.0 |

```
az.compare({"intercept-only" : avg_fitted,
"lag-model": lag_fitted
})
```

rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|

lag-model | 0 | 785.126819 | 2.909173 | 0.000000 | 0.99836 | 14.540186 | 0.000000 | False | log |

intercept-only | 1 | 774.161032 | 2.067113 | 10.965787 | 0.00164 | 15.314248 | 4.651392 | False | log |

Adding the predictor results in a higher loo than the intercept-only model.

```
= model_lag.predict(lag_fitted, kind="response")
ppc; az.plot_ppc(lag_fitted)
```

The biggest question this helps us understand is, for each point of batting average in the previous year, how much better do we expect a player to be in the current year?

```
= lag_fitted.posterior.batting_avg_shift.mean().item()
mean_effect = az.hdi(lag_fitted.posterior.batting_avg_shift, hdi_prob=.95)
hdi
= expit(hdi.batting_avg_shift[0]).item()
lower = expit(hdi.batting_avg_shift[1]).item()
upper print(f"Mean effect: {expit(mean_effect):.4f}")
print(f"95% interval {lower:.4f} - {upper:.4f}")
```

```
Mean effect: 0.7956
95% interval 0.6944 - 0.8679
```

```
=0.95, color="goldenrod", fill_kwargs={"alpha":0.8})
az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob=.68, color="forestgreen", fill_kwargs={"alpha":0.8})
az.plot_hdi(df_shift.batting_avg_shift, lag_fitted.posterior_predictive.batting_avg, hdi_prob
= lag_fitted.posterior.Intercept.values.mean()
intercept = np.linspace(df_shift.batting_avg_shift.min(), df_shift.batting_avg_shift.max(),100)
x = mean_effect * x + intercept
linear ="black")
plt.plot(x, expit(linear), c"Previous Year's Batting Average")
plt.xlabel("Batting Average"); plt.ylabel(
```

```
%load_ext watermark
%watermark -n -u -v -iv -w
```

```
Last updated: Sat May 25 2024
Python implementation: CPython
Python version : 3.11.9
IPython version : 8.24.0
pandas : 2.2.2
matplotlib: 3.8.4
numpy : 1.26.4
scipy : 1.13.1
arviz : 0.18.0
bambi : 0.13.1.dev37+g2a54df76.d20240525
Watermark: 2.4.3
```