import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
Hierarchical Logistic regression with Binomial family
This notebook shows how to build a hierarchical logistic regression model with the Binomial family in Bambi.
This example is based on the Hierarchical baseball article in Bayesian Analysis Recipes, a collection of articles on how to do Bayesian data analysis with PyMC3 made by Eric Ma.
Problem description
Extracted from the original work:
Baseball players have many metrics measured for them. Let’s say we are on a baseball team, and would like to quantify player performance, one metric being their batting average (defined by how many times a batter hit a pitched ball, divided by the number of times they were up for batting (“at bat”)). How would you go about this task?
Load libraries and data
= 1234 random_seed
We first need some measurements of batting data. Today we’re going to use data from the Baseball Databank. It is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This repository contains several datasets in the form of .csv
files. This example is going to use the Batting.csv
file, which can be loaded directly with Bambi in a convenient way.
= bmb.load_data("batting")
df
# Then clean some of the data
"AB"] = df["AB"].replace(0, np.nan)
df[= df.dropna()
df "batting_avg"] = df["H"] / df["AB"]
df[= df[df["yearID"] >= 2016]
df = df.iloc[0:15]
df 5) df.head(
playerID | yearID | stint | teamID | lgID | G | AB | R | H | 2B | ... | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | batting_avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
101348 | abadfe01 | 2016 | 1 | MIN | AL | 39 | 1.0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 |
101350 | abreujo02 | 2016 | 1 | CHA | AL | 159 | 624.0 | 67 | 183 | 32 | ... | 0.0 | 2.0 | 47 | 125.0 | 7.0 | 15.0 | 0.0 | 9.0 | 21.0 | 0.293269 |
101352 | ackledu01 | 2016 | 1 | NYA | AL | 28 | 61.0 | 6 | 9 | 0 | ... | 0.0 | 0.0 | 8 | 9.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.147541 |
101353 | adamecr01 | 2016 | 1 | COL | NL | 121 | 225.0 | 25 | 49 | 7 | ... | 2.0 | 3.0 | 24 | 47.0 | 0.0 | 4.0 | 3.0 | 0.0 | 5.0 | 0.217778 |
101355 | adamsma01 | 2016 | 1 | SLN | NL | 118 | 297.0 | 37 | 74 | 18 | ... | 0.0 | 1.0 | 25 | 81.0 | 1.0 | 2.0 | 0.0 | 3.0 | 5.0 | 0.249158 |
5 rows × 23 columns
From all the columns above, we’re going to use the following:
playerID
: Unique identification for the player.AB
: Number of times the player was up for batting.H
: Number of times the player hit the ball while batting.batting_avg
: Simply ratio betweenH
andAB
.
Explore our data
It’s always good to explore the data before starting to write down our models. This is very useful to gain a good understanding of the distribution of the variables and their relationships, and even anticipate some problems that may occur during the sampling process.
The following graph summarizes the percentage of hits, as well as the number of times the players were up for batting and the number of times they hit the ball.
= "#2a5674"
BLUE = "#b13f64" RED
= plt.subplots(figsize=(10, 6))
_, ax
# Customize x limits.
# This adds space on the left side to indicate percentage of hits.
-120, 320)
ax.set_xlim(
# Add dots for the the number of hits and the times at bat
"H"], list(range(15)), s=140, color=RED, zorder=10)
ax.scatter(df["AB"], list(range(15)), s=140, color=BLUE, zorder=10)
ax.scatter(df[
# Also a line connecting them
list(range(15)), df["H"], df["AB"], color="#b3b3b3", lw=4)
ax.hlines(
="--", lw=1.4, color="#a3a3a3")
ax.axvline(lslist(range(15)), -110, -50, lw=6, color="#b3b3b3", capstyle="round")
ax.hlines(60 * df["batting_avg"] - 110, list(range(15)), s=28, color=RED, zorder=10)
ax.scatter(
# Add the percentage of hits
for j in range(15):
= f"{round(df['batting_avg'].iloc[j] * 100)}%"
text -12, j, text, ha="right", va="center", fontsize=14, color="#333")
ax.text(
# Customize tick positions and labels
list(range(15)))
ax.yaxis.set_ticks("playerID"])
ax.yaxis.set_ticklabels(df[range(0, 400, 100))
ax.xaxis.set_ticks(
# Create handles for the legend (just dots and labels)
= [
handles
Line2D(0], [0], label="Hits", marker="o", color="None", markeredgewidth=0,
[=RED, markersize=13
markerfacecolor
),
Line2D(0], [0], label="At Bat", marker="o", color="None", markeredgewidth=0,
[=BLUE, markersize=12
markerfacecolor
)
]
# Add legend on top-right corner
= ax.legend(
legend =handles,
handles=1,
loc=14,
fontsize=0.4,
handletextpad=True
frameon
)
# Finally add labels and a title
"Count", fontsize=14)
ax.set_xlabel("Player", fontsize=14)
ax.set_ylabel("How often do batters hit the ball?", fontsize=20); ax.set_title(
The first thing one can see is that the number of times players were up for batting varies quite a lot. Some players have been there for very few times, while there are others who have been there hundreds of times. We can also note the percentage of hits is usually a number between 12% and 29%.
There are two players, alberma01 and abadfe01, who had only one chance to bat. The first one hit the ball, while the latter missed. That’s why alberma01 as a 100% hit percentage, while abadfe01 has 0%. There’s another player, aguilje01, who has a success record of 0% because he missed all the few opportunities he had to bat. These extreme situations, where the empirical estimation lives in the boundary of the parameter space, are associated with estimation problems when using a maximum-likelihood estimation approach. Nonetheless, they can also impact the sampling process, especially when using wide priors.
As a final note, abreujo02, has been there for batting 624 times, and thus the grey dot representing this number does not appear in the plot.
Non-hierarchical model
Let’s get started with a simple cell-means logistic regression for \(p_i\), the probability of hitting the ball for the player \(i\)
\[ \begin{array}{lr} \displaystyle \text{logit}(p_i) = \beta_i & \text{with } i = 0, \cdots, 14 \end{array} \]
Where
\[ \beta_i \sim \text{Normal}(0, \ \sigma_{\beta}), \]
\(\sigma_{\beta}\) is a common constant for all the players, and \(\text{logit}(p_i) = \log\left(\frac{p_i}{1 - p_i}\right)\).
Specifying this model is quite simple in Bambi thanks to its formula interface.
First of all, note this is a Binomial family and the response involves both the number of hits (H
) and the number of times at bat (AB
). We use the p(x, n)
function for the response term. This just tells Bambi we want to model the proportion resulting from dividing x
over n
.
The right-hand side of the formula is "0 + playerID"
. This means the model includes a coefficient for each player ID, but does not include a global intercept.
Finally, using the Binomial family is as easy as passing family="binomial"
. By default, the link function for this family is link="logit"
, so there’s nothing to change there.
= bmb.Model("p(H, AB) ~ 0 + playerID", df, family="binomial")
model_non_hierarchical model_non_hierarchical
Formula: p(H, AB) ~ 0 + playerID
Family: binomial
Link: p = logit
Observations: 15
Priors:
target = p
Common-level effects
playerID ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], sigma: [1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1.])
= model_non_hierarchical.fit(random_seed=random_seed) idata_non_hierarchical
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [playerID]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.
Next we observe the posterior of the coefficient for each player. The compact=False
argument means we want separated panels for each player.
=False, backend_kwargs={"layout": "constrained"}); az.plot_trace(idata_non_hierarchical, compact
So far so good! The traceplots indicate the sampler worked well.
Now, let’s keep this posterior aside for later use and let’s fit the hierarchical version.
Hierarchical model
This model incorporates a group-specific intercept for each player:
\[ \begin{array}{lr} \displaystyle \text{logit}(p_i) = \alpha + \gamma_i & \text{with } i = 0, \cdots, 14 \end{array} \]
where
\[ \begin{array}{c} \alpha \sim \text{Normal}(0, \ \sigma_{\alpha}) \\ \gamma_i \sim \text{Normal}(0, \ \sigma_{\gamma}) \\ \sigma_{\gamma} \sim \text{HalfNormal}(\tau_{\gamma}) \end{array} \]
The group-specific terms are indicated with the |
operator in the formula. In this case, since there is an intercept for each player, we write 1|playerID
.
= bmb.Model("p(H, AB) ~ 1 + (1|playerID)", df, family="binomial")
model_hierarchical model_hierarchical
Formula: p(H, AB) ~ 1 + (1|playerID)
Family: binomial
Link: p = logit
Observations: 15
Priors:
target = p
Common-level effects
Intercept ~ Normal(mu: 0.0, sigma: 1.5)
Group-level effects
1|playerID ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 2.5))
= model_hierarchical.fit(random_seed=random_seed) idata_hierarchical
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 14 seconds.
There were 6 divergences after tuning. Increase `target_accept` or reparameterize.
Sometimes, there can be several divergences when fitting a hierarchical model. What can we do in that case?
One thing we could try is increasing target_accept
. But if there are many divergences, that suggests a problem with the underlying model. Let’s take a look at the prior predictive distribution to see whether our priors are too informative or too wide.
The Model
instance has a method called prior_predictive()
that generates samples from the prior predictive distribution. It returns an InferenceData
object that contains the values of the prior predictive distribution.
= model_hierarchical.prior_predictive()
idata_prior = az.extract_dataset(idata_prior, group="prior_predictive")["p(H, AB)"] prior
Sampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]
/tmp/ipykernel_12852/2686921361.py:2: FutureWarning: extract_dataset has been deprecated, please use extract
prior = az.extract_dataset(idata_prior, group="prior_predictive")["p(H, AB)"]
If we inspect the DataArray, we see there are 500 draws (sample
) for each of the 15 players (__obs__
)
Let’s plot these distributions together with the observed proportion of hits for every player here.
# We define this function because this plot is going to be repeated below.
def plot_prior_predictive(df, prior):
= df["AB"].values
AB = df["H"].values
H
= plt.subplots(5, 3, figsize=(10, 6), sharex="col")
fig, axes
for idx, ax in enumerate(axes.ravel()):
= prior.sel({"__obs__":idx})
pps = AB[idx]
ab = H[idx]
h = ax.hist(pps / ab, bins=25, color="#a3a3a3")
hist / ab, color=RED, lw=2)
ax.axvline(h
ax.set_yticks([])=12)
ax.tick_params(labelsize
=0.025, right=0.975, hspace=0.05, wspace=0.05, bottom=0.125)
fig.subplots_adjust(left
fig.legend(=[Line2D([0], [0], label="Observed proportion", color=RED, linewidth=2)],
handles=1.5,
handlelength=0.8,
handletextpad=0,
borderaxespad=True,
frameon=11,
fontsize=(0.975, 0.92),
bbox_to_anchor="right"
loc
)0.5, 0.05, "Prior probability of hitting", fontsize=15, ha="center", va="baseline") fig.text(
plot_prior_predictive(df, prior)
Indeed, priors are too wide! Let’s use tighter priors and see what’s the result
= {
priors "Intercept": bmb.Prior("Normal", mu=0, sigma=1),
"1|playerID": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1))
}= bmb.Model("p(H, AB) ~ 1 + (1|playerID)", df, family="binomial", priors=priors)
model_hierarchical model_hierarchical
Formula: p(H, AB) ~ 1 + (1|playerID)
Family: binomial
Link: p = logit
Observations: 15
Priors:
target = p
Common-level effects
Intercept ~ Normal(mu: 0.0, sigma: 1.0)
Group-level effects
1|playerID ~ Normal(mu: 0.0, sigma: HalfNormal(sigma: 1.0))
Now let’s check the prior predictive distribution for these new priors.
model_hierarchical.build()= model_hierarchical.prior_predictive()
idata_prior = az.extract_dataset(idata_prior, group="prior_predictive")["p(H, AB)"]
prior plot_prior_predictive(df, prior)
Sampling: [1|playerID_offset, 1|playerID_sigma, Intercept, p(H, AB)]
/tmp/ipykernel_12852/1302716284.py:3: FutureWarning: extract_dataset has been deprecated, please use extract
prior = az.extract_dataset(idata_prior, group="prior_predictive")["p(H, AB)"]
Definetely it looks much better. Now the priors tend to have a symmetric shape with a mode at 0.5, with substantial probability on the whole domain.
= model_hierarchical.fit(random_seed=random_seed) idata_hierarchical
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 13 seconds.
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
Let’s try with increasing target_accept
and the number of tune
samples.
= model_hierarchical.fit(tune=2000, draws=2000, target_accept=0.95, random_seed=random_seed) idata_hierarchical
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, 1|playerID_sigma, 1|playerID_offset]
Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 24 seconds.
= ["Intercept", "1|playerID", "1|playerID_sigma"]
var_names =var_names, compact=False, backend_kwargs={"layout": "constrained"}); az.plot_trace(idata_hierarchical, var_names
Let’s jump onto the next section where we plot and compare the probability of hit for the players using both models.
Compare predictions
Now we’re going to plot the distribution of the probability of hit for each player, using both models.
But before doing that, we need to obtain the posterior in that scale. We could manually take the posterior of the coefficients, compute the linear predictor, and transform that to the probability scale. But that’s a lot of work!
Fortunately, Bambi models have a method called .predict()
that we can use to predict in the probability scale. By default, it modifies in-place the InferenceData
object we pass to it. Then, the posterior samples can be found in the variable p
.
model_non_hierarchical.predict(idata_non_hierarchical) model_hierarchical.predict(idata_hierarchical)
Let’s create a forestplot using the posteriors obtained with both models so we can compare them very easily .
= plt.subplots(figsize = (8, 8))
_, ax
# Add vertical line for the global probability of hitting
=(df["H"] / df["AB"]).mean(), ls="--", color="black", alpha=0.5)
ax.axvline(x
# Create forestplot with ArviZ, only for the mean.
az.plot_forest(
[idata_non_hierarchical, idata_hierarchical], ="p",
var_names=True,
combined=["#666666", RED],
colors=2.6,
linewidth=8,
markersize=ax
ax
)
# Create custom y axis tick labels
= [f"H: {round(h)}, AB: {round(ab)}" for h, ab in zip(df["H"].values, df["AB"].values)]
ylabels = list(reversed(ylabels))
ylabels
# Put the labels for the y axis in the mid of the original location of the tick marks.
="right")
ax.set_yticklabels(ylabels, ha
# Create legend
= [
handles ="Non-hierarchical", facecolor="#666666"),
Patch(label="Hierarchical", facecolor=RED),
Patch(label0], [0], label="Mean probability", ls="--", color="black", alpha=0.5)
Line2D([
]
= ax.legend(handles=handles, loc=4, fontsize=14, frameon=True, framealpha=0.8); legend
One of the first things one can see is that not only the center of the distributions varies but also their dispersion. Those posteriors that are very wide are associated with players who have batted only once or few times, while tighter posteriors correspond to players who batted several times.
Players who have extreme empirical proportions have similar extreme posteriors under the non-hierarchical model. However, under the hierarchical model, these distributions are now shrunk towards the global mean. Extreme values are very unlikely under the hierarchical model.
And finally, paraphrasing Eric, there’s nothing ineherently right or wrong about shrinkage and hierarchical models. Whether this is reasonable or not depends on our prior knowledge about the problem. And to me, after having seen the hit rates of the other players, it is much more reasonable to shrink extreme posteriors based on very few data points towards the global mean rather than just let them concentrate around 0 or 1.
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Thu Aug 15 2024
Python implementation: CPython
Python version : 3.11.9
IPython version : 8.24.0
bambi : 0.14.1.dev12+g64e57423.d20240730
numpy : 1.26.4
matplotlib: 3.8.4
arviz : 0.18.0
Watermark: 2.4.3
Footnotes
- By default, the
.predict()
method obtains the posterior for the mean of the likelihood distribution. This mean would be \(np\) for the Binomial family. However, since \(n\) varies from observation to observation, it returns the value of \(p\), as if it was a Bernoulli family.