# Categorical Regression#

In this example, we will use the categorical family to model outcomes with more than two categories. The examples in this notebook were constructed by Tomás Capretto, and assembled into this example by Tyler James Burch (@tjburch on GitHub).

[1]:

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from matplotlib.lines import Line2D

[2]:

SEED = 1234
az.style.use("arviz-darkgrid")


When modeling binary outcomes with Bambi, the Bernoulli family is used. The multivariate generalization of the Bernoulli family is the Categorical family, and with it, we can model an arbitrary number of outcome categories.

## Example with toy data#

To start, we will create a toy dataset with three classes.

[3]:

rng = np.random.default_rng(SEED)
x = np.hstack([rng.normal(m, s, size=50) for m, s in zip([-2.5, 0, 2.5], [1.2, 0.5, 1.2])])
y = np.array(["A"] * 50 + ["B"] * 50 + ["C"] * 50)

colors = ["C0"] * 50 + ["C1"] * 50 + ["C2"] * 50
plt.scatter(x, np.random.uniform(size=150), color=colors)
plt.xlabel("x")
plt.ylabel("y");


Here we have 3 classes, generated from three normal distributions: $$N(-2.5, 1.2)$$, $$N(0, 0.5)$$, and $$N(2.5, 1.2)$$. Creating a model to fit these distributions,

[4]:

data = pd.DataFrame({"y": y, "x": x})
model = bmb.Model("y ~ x", data, family="categorical")
idata = model.fit()

/home/osvaldo/anaconda3/lib/python3.9/site-packages/aesara/tensor/nnet/basic.py:1116: FutureWarning: Softmax now accepts an axis argument. For backwards-compatibility it defaults to -1 when not specified, but in the future the default will be None.
To suppress this warning specify axis explicitly.
warnings.warn(
Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, x]

100.00% [8000/8000 00:03<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.


Note that we pass the family="categorical" argument to Bambi’s Model method in order to call the categorical family. Here, the response variable are strings (“A”, “B”, “C”), however they can also be pd.Categorical objects.

Next we will use posterior predictions to visualize the mean class probability across the $$x$$ spectrum.

[5]:

x_new = np.linspace(-5, 5, num=200)
model.predict(idata, data=pd.DataFrame({"x": x_new}))
p = idata.posterior["y_mean"].sel(draw=slice(0, None, 10))

[6]:

x_new = np.linspace(-5, 5, num=200)
model.predict(idata, data=pd.DataFrame({"x": x_new}))
p = idata.posterior["y_mean"].sel(draw=slice(0, None, 10))

for j, g in enumerate("ABC"):
plt.plot(x_new, p.sel({"y_mean_dim":g}).stack(samples=("chain", "draw")), color=f"C{j}", alpha=0.2)

plt.xlabel("x")
plt.ylabel("y");


Here, we can notice that the probability phases between classes from left to right. At all points across $$x$$, sum of the class probabilities is 1, since in our generative model, it must be one of these three outcomes.

## The iris dataset#

Next, we will look at the classic “iris” dataset, which contains samples from 3 different species of iris plants. Using properties of the plant, we will try to model its species.

[7]:

iris = sns.load_dataset("iris")

[7]:

sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa

The dataset includes four different properties of the plants: it’s sepal length, sepal width, petal length, and petal width. There are 3 different class possibilities: setosa, versicolor, and virginica.

[8]:

sns.pairplot(iris, hue="species");

/home/osvaldo/anaconda3/lib/python3.9/site-packages/seaborn/axisgrid.py:88: UserWarning: This figure was using constrained_layout, but that is incompatible with subplots_adjust and/or tight_layout; disabling constrained_layout.
self._figure.tight_layout(*args, **kwargs)


We can see the three species have several distinct characteristics, which our linear model can capture to distinguish between them.

[9]:

model = bmb.Model(
"species ~ sepal_length + sepal_width + petal_length + petal_width",
iris,
family="categorical",
)
idata = model.fit()
az.summary(idata)

/home/osvaldo/anaconda3/lib/python3.9/site-packages/aesara/tensor/nnet/basic.py:1116: FutureWarning: Softmax now accepts an axis argument. For backwards-compatibility it defaults to -1 when not specified, but in the future the default will be None.
To suppress this warning specify axis explicitly.
warnings.warn(
Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, sepal_length, sepal_width, petal_length, petal_width]

100.00% [8000/8000 00:15<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 15 seconds.

[9]:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept[versicolor] -6.766 7.751 -21.300 7.183 0.175 0.124 1953.0 2268.0 1.0
Intercept[virginica] -22.362 9.368 -40.455 -5.504 0.185 0.131 2558.0 2649.0 1.0
sepal_length[versicolor] 3.086 1.667 -0.146 6.067 0.047 0.033 1266.0 1994.0 1.0
sepal_length[virginica] 2.281 1.723 -1.142 5.403 0.048 0.034 1277.0 1888.0 1.0
sepal_width[versicolor] -4.694 1.952 -8.115 -0.783 0.049 0.035 1581.0 2104.0 1.0
sepal_width[virginica] -6.650 2.324 -11.190 -2.558 0.058 0.041 1601.0 1817.0 1.0
petal_length[versicolor] 1.066 0.909 -0.617 2.726 0.022 0.016 1697.0 2085.0 1.0
petal_length[virginica] 4.017 1.028 2.057 5.918 0.022 0.016 2178.0 2363.0 1.0
petal_width[versicolor] 1.942 2.106 -2.140 5.742 0.052 0.037 1651.0 1808.0 1.0
petal_width[virginica] 9.057 2.341 4.864 13.731 0.059 0.042 1559.0 1616.0 1.0
[10]:

az.plot_trace(idata);


We can see that this has fit quite nicely. You’ll notice there are $$n-1$$ parameters to fit, where $$n$$ is the number of categories. In the minimal binary case, recall there’s only one parameter set, since it models probability $$p$$ of being in a class, and probability $$1-p$$ of being in the other class. Using the categorical distribution, this extends, so we have $$p_1$$ for class 1, $$p_2$$ for class 2, and $$1-(p_1+p_2)$$ for the final class.

## Using numerical and categorical predictors#

Next we will look at an example from chapter 8 of Alan Agresti’s Categorical Data Analysis, looking at the primary food choice for 64 alligators caught in Lake George, Florida. We will use their length (a continuous variable) and sex (a categorical variable) as predictors to model their food choice.

First, reproducing the dataset,

[11]:

length = [
1.3, 1.32, 1.32, 1.4, 1.42, 1.42, 1.47, 1.47, 1.5, 1.52, 1.63, 1.65, 1.65, 1.65, 1.65,
1.68, 1.7, 1.73, 1.78, 1.78, 1.8, 1.85, 1.93, 1.93, 1.98, 2.03, 2.03, 2.31, 2.36, 2.46,
3.25, 3.28, 3.33, 3.56, 3.58, 3.66, 3.68, 3.71, 3.89, 1.24, 1.3, 1.45, 1.45, 1.55, 1.6,
1.6, 1.65, 1.78, 1.78, 1.8, 1.88, 2.16, 2.26, 2.31, 2.36, 2.39, 2.41, 2.44, 2.56, 2.67,
2.72, 2.79, 2.84
]
choice = [
"I", "F", "F", "F", "I", "F", "I", "F", "I", "I", "I", "O", "O", "I", "F", "F",
"I", "O", "F", "O", "F", "F", "I", "F", "I", "F", "F", "F", "F", "F", "O", "O",
"F", "F", "F", "F", "O", "F", "F", "I", "I", "I", "O", "I", "I", "I", "F", "I",
"O", "I", "I", "F", "F", "F", "F", "F", "F", "F", "O", "F", "I", "F", "F"
]

sex = ["Male"] * 32 + ["Female"] * 31
data = pd.DataFrame({"choice": choice, "length": length, "sex": sex})
data["choice"]  = pd.Categorical(
data["choice"].map({"I": "Invertebrates", "F": "Fish", "O": "Other"}),
["Other", "Invertebrates", "Fish"],
ordered=True
)

[11]:

choice length sex
0 Invertebrates 1.30 Male
1 Fish 1.32 Male
2 Fish 1.32 Male

Next, constructing the model,

[12]:

model = bmb.Model("choice ~ length + sex", data, family="categorical")
idata = model.fit()

/home/osvaldo/anaconda3/lib/python3.9/site-packages/aesara/tensor/nnet/basic.py:1116: FutureWarning: Softmax now accepts an axis argument. For backwards-compatibility it defaults to -1 when not specified, but in the future the default will be None.
To suppress this warning specify axis explicitly.
warnings.warn(
Auto-assigning NUTS sampler...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [Intercept, length, sex]

100.00% [8000/8000 00:04<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 5 seconds.


We can then look at how the food choices vary by length for both male and female alligators.

[13]:

new_length = np.linspace(1, 4)
new_data = pd.DataFrame({"length": np.tile(new_length, 2), "sex": ["Male"] * 50 + ["Female"] * 50})
model.predict(idata, data=new_data)
p = idata.posterior["choice_mean"]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
choices = ["Other", "Invertebrates", "Fish"]

for j, choice in enumerate(choices):
males = p.sel({"choice_mean_dim":choice, "choice_obs":slice(0, 49)})
females = p.sel({"choice_mean_dim":choice, "choice_obs":slice(50, 100)})
axes[0].plot(new_length, males.mean(("chain", "draw")), color=f"C{j}", lw=2)
axes[1].plot(new_length, females.mean(("chain", "draw")), color=f"C{j}", lw=2)
az.plot_hdi(new_length, males, color=f"C{j}", ax=axes[0])
az.plot_hdi(new_length, females, color=f"C{j}", ax=axes[1])

axes[0].set_title("Male")
axes[1].set_title("Female")

handles = [Line2D([], [], color=f"C{j}", label=choice) for j, choice in enumerate(choices)]

fig.legend(
handles,
choices,
loc="center right",
ncol=3,
bbox_to_anchor=(0.99, 0.95),
bbox_transform=fig.transFigure
);

/tmp/ipykernel_28477/2509692565.py:21: UserWarning: This figure was using constrained_layout, but that is incompatible with subplots_adjust and/or tight_layout; disabling constrained_layout.


Here we can see that the larger male and female alligators are, the less of a taste they have for invertebrates, and far prefer fish. Additionally, males seem to have a higher propensity to consume “other” foods compared to females at any size. Of note, the posterior means predicted by Bambi contain information about all $$n$$ categories (despite having only $$n-1$$ coefficients), so we can directly construct this plot, rather than manually calculating $$1-(p_1+p_2)$$ for the third class.

Last, we can make a posterior predictive plot,

[14]:

model.predict(idata, kind="pps")

ax = az.plot_ppc(idata)
ax.set_xticks([0.5, 1.5, 2.5])
ax.set_xticklabels(model.response.levels)
ax.set_xlabel("Choice");
ax.set_ylabel("Probability");


which depicts posterior predicted probability for each possible food choice for an alligator, which reinforces fish being the most likely food choice, followed by invertebrates.

### References#

Agresti, A. (2013) Categorical Data Analysis. 3rd Edition, John Wiley & Sons Inc., Hoboken.

[15]:

%load_ext watermark
%watermark -n -u -v -iv -w

Last updated: Mon Jun 20 2022

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 8.3.0

seaborn   : 0.11.2
pandas    : 1.4.2
bambi     : 0.9.0
arviz     : 0.13.0.dev0
numpy     : 1.21.5
matplotlib: 3.5.1
sys       : 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]

Watermark: 2.3.0