Ordinal Regression

import arviz as az
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import numpy as np
import pandas as pd
import warnings

import bambi as bmb

warnings.filterwarnings("ignore", category=FutureWarning)

In some scenarios, the response variable is discrete, like a count, and ordered. Common examples of such data come from questionnaires where the respondent is asked to rate a product, service, or experience on a scale. This scale is often referred to as a Likert scale. For example, a five-level Likert scale could be:

The result is a set of ordered categories where each category has an associated numeric value (1-5). However, you can’t compute a meaningful difference between the categories. Moreover, the response variable can also be a count where meaningful differences can be computed. For example, a restaurant can be rated on a scale of 1-5 stars where 1 is the worst and 5 is the best. Yes, you can compute the difference between 1 and 2 stars, but it is often treated as ordinal in an applied setting.

Ordinal data presents three challenges when modelling:

  1. Unlike a count, the differences in the values are not necessarily equidistant or meaningful. For example, computing the difference between “Strongly disagree” and “Disagree”. Or, in the case of the restaurant rating, it may be much harder for a restuarant to go from 4 to 5 stars than from 2 to 3 stars.
  2. The distribution of ordinal responses may be nonnormal as the response is not continuous; particularly if larger response levels are infrequently chosen compared to lower ones.
  3. The variances of the unobserved variables that underlie the observed ordered category may differ between the category, time points, etc.

Thus, treating ordered categories as continuous is not appropriate. To this extent, Bambi supports two classes of ordinal regression models: (1) cumulative, and (2) sequential. Below, it is demonstrated how to fit these two models using Bambi to overcome the challenges of ordered category response data.

Cumulative model

A cumulative model assumes that the observed ordinal variable \(Y\) originates from the “categorization” of a latent continuous variable \(Z\). To model the categorization process, the model assumes that there are \(K\) thresholds (or cutpoints) \(\tau_k\) that partition \(Z\) into \(K+1\) observable, ordered categories of \(Y\). The subscript \(k\) in \(\tau_k\) is an index that associates that threshold to a particular category \(k\). For example, if the response has three categories such as “disagree”, “neither agree nor disagree”, and “agree”, then there are two thresholds \(\tau_1\) and \(\tau_2\) that partition \(Z\) into \(K+1 = 3\) categories. Additionally, if we assume \(Z\) to have a certain distribution (e.g., Normal) with a cumulative distribution function \(F\), the probability of \(Y\) being equal to category \(k\) is

\[P(Y = k) = F(\tau_k) - F(\tau_{k-1})\]

where \(F(\tau)\) is a cumulative probability. For example, suppose we are interested in the probability of each category stated above, and have two thresholds \(\tau_1 = -1, \tau_2 = 1\) for the three categories. Additionally, if we assume \(Z\) to be normally distributed with \(\sigma = 1\) and a cumulative distribution function \(\Phi\) then

\[P(Y = 1) = \Phi(\tau_1) = \Phi(-1)\]

\[P(Y = 2) = \Phi(\tau_2) - \Phi(\tau_1) = \Phi(1) - \Phi(-1)\]

\[P(Y = 3) = 1 - \Phi(\tau_2) = 1 - \Phi(1)\]

But how to set the values of the thresholds? By default, Bambi uses a Normal distribution with a grid of evenly spaced \(\mu\) that depends on the number of response levels as the prior for the thresholds. Additionally, since the thresholds need to be orderd, Bambi applies a transformation to the values such that the order is preserved. Furthermore, the model specification for ordinal regression typically transforms the cumulative probabilities using the log-cumulative-odds (logit) transformation. Therefore, the learned parameters for the thresholds \(\tau\) will be logits.

Lastly, as each \(F(\tau)\) implies a cumulative probability for each category, the largest response level always has a cumulative probability of 1. Thus, we effectively do not need a parameter for it due to the law of total probability. For example, for three response values, we only need two thresholds as two thresholds partition \(Z\) into \(K+1\) categories.

The moral intuition dataset

To illustrate an cumulative ordinal model, we will model data from a series of experiments conducted by philsophers (this example comes from Richard McElreath’s Statistical Rethinking). The experiments aim to collect empirical evidence relevant to debates about moral intuition, the forms of reasoning through which people develop judgments about the moral goodness and badness of actions.

In the dataset there are 12 columns and 9930 rows, comprising data for 331 unique individuals. The response we are interested in response, is an integer from 1 to 7 indicating how morally permissible the participant found the action to be taken (or not) in the story. The predictors are as follows:

  • action: a factor with levels 0 and 1 where 1 indicates that the story contained “harm caused by action is morally worse than equivalent harm caused by omission”.
  • intention: a factor with levels 0 and 1 where 1 indicates that the story contained “harm intended as the means to a goal is morally worse than equivalent harm foreseen as the side effect of a goal”.
  • contact: a factor with levels 0 and 1 where 1 indicates that the story contained “using physical contact to cause harm to a victim is morally worse than causing equivalent harm to a victim without using physical contact”.
trolly = pd.read_csv("https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/Trolley.csv", sep=";")
trolly = trolly[["response", "action", "intention", "contact"]]
trolly["action"] = pd.Categorical(trolly["action"], ordered=False)
trolly["intention"] = pd.Categorical(trolly["intention"], ordered=False)
trolly["contact"] = pd.Categorical(trolly["contact"], ordered=False)
trolly["response"] = pd.Categorical(trolly["response"], ordered=True)
# 7 ordered categories from 1-7
trolly.response.unique()
[4, 3, 5, 2, 1, 7, 6]
Categories (7, int64): [1 < 2 < 3 < 4 < 5 < 6 < 7]

Intercept only model

Before we fit a model with predictors, let’s attempt to recover the parameters of an ordinal model using only the thresholds to get a feel for the cumulative family. Traditionally, in Bambi if we wanted to recover the parameters of the likelihood, we would use an intercept only model and write the formula as response ~ 1 where 1 indicates to include the intercept. However, in the case of ordinal regression, the thresholds “take the place” of the intercept. Thus, we can write the formula as response ~ 0 to indicate that we do not want to include an intercept. To fit a cumulative ordinal model, we pass family="cumulative". To compare the thresholds only model, we compute the empirical log-cumulative-odds of the categories directly from the data below and generate a bar plot of the response probabilities.

pr_k = trolly.response.value_counts().sort_index().values / trolly.shape[0]
cum_pr_k = np.cumsum(pr_k)
logit_func = lambda x: np.log(x / (1 - x))
cum_logit = logit_func(cum_pr_k)
cum_logit
/tmp/ipykernel_117147/1548491577.py:3: RuntimeWarning: invalid value encountered in log
  logit_func = lambda x: np.log(x / (1 - x))
array([-1.91609116, -1.26660559, -0.718634  ,  0.24778573,  0.88986365,
        1.76938091,         nan])
plt.figure(figsize=(7, 3))
plt.bar(np.arange(1, 8), pr_k)
plt.ylabel("Probability")
plt.xlabel("Response")
plt.title("Empirical probability of each response category");

model = bmb.Model("response ~ 0", data=trolly, family="cumulative")
idata = model.fit(random_seed=1234)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [threshold]
/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/site-packages/pytensor/compile/function/types.py:970: RuntimeWarning: invalid value encountered in add
  self.vm()


Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 8 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics

Below, the components of the model are outputed. Notice how the thresholds are a grid of six values ranging from -2 to 2.

model
       Formula: response ~ 0
        Family: cumulative
          Link: p = logit
  Observations: 9930
        Priors: 
    target = p
        
        
        Auxiliary parameters
            threshold ~ Normal(mu: [-2.  -1.2 -0.4  0.4  1.2  2. ], sigma: 1.0, transform: ordered)
------
* To see a plot of the priors call the .plot_priors() method.
* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
az.summary(idata)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
threshold[0] -1.917 0.030 -1.969 -1.859 0.001 0.0 2293.0 1520.0 1.0
threshold[1] -1.267 0.024 -1.314 -1.223 0.000 0.0 2826.0 1286.0 1.0
threshold[2] -0.718 0.021 -0.759 -0.679 0.000 0.0 2621.0 2089.0 1.0
threshold[3] 0.248 0.020 0.211 0.286 0.000 0.0 2491.0 1849.0 1.0
threshold[4] 0.891 0.022 0.850 0.931 0.000 0.0 2334.0 1882.0 1.0
threshold[5] 1.771 0.028 1.719 1.822 0.001 0.0 2352.0 1795.0 1.0

Viewing the summary dataframe, we see a total of six response_threshold coefficients. Why six? Remember, we get the last parameter for free. Since there are seven categories, we only need six cutpoints. The index (using zero based indexing) of the response_threshold indicates the category that the threshold is associated with. Comparing to the empirical log-cumulative-odds computation above, the mean of the posterior distribution for each category is close to the empirical value.

As the the log cumulative link is used, we need to apply the inverse of the logit function to transform back to cumulative probabilities. Below, we plot the cumulative probabilities for each category.

expit_func = lambda x: 1 / (1 + np.exp(-x))
cumprobs = expit_func(idata.posterior.threshold).mean(("chain", "draw"))
cumprobs = np.append(cumprobs, 1)

plt.figure(figsize=(7, 3))
plt.plot(sorted(trolly.response.unique()), cumprobs, marker='o')
plt.ylabel("Cumulative probability")
plt.xlabel("Response category")
plt.title("Cumulative probabilities of response categories");

fig, ax = plt.subplots(figsize=(7, 3))
for i in range(6):
    outcome = expit_func(idata.posterior.threshold).sel(threshold_dim=i).to_numpy().flatten()
    ax.hist(outcome, bins=15, alpha=0.5, label=f"Category: {i}")
ax.set_xlabel("Probability")
ax.set_ylabel("Count")
ax.set_title("Cumulative Probability by Category")
ax.legend(bbox_to_anchor=(1.04, 1), loc="upper left");

We can take the derivative of the cumulative probabilities to get the posterior probabilities for each category. Notice how the posterior probabilities in the barplot below are close to the empirical probabilities in barplot above.

# derivative
ddx = np.diff(cumprobs)
probs = np.insert(ddx, 0, cumprobs[0])

plt.figure(figsize=(7, 3))
plt.bar(sorted(trolly.response.unique()), probs)
plt.ylabel("Probability")
plt.xlabel("Response category")
plt.title("Posterior Probability of each response category");

Notice in the plots above, the jump in probability from category 3 to 4. Additionally, the estimates of the coefficients is precise for each category. Now that we have an understanding how the cumulative link function is applied to produce ordered cumulative outcomes, we will add predictors to the model.

Adding predictors

In the cumulative model described above, adding predictors was explicitly left out. In this section, it is described how predictors are added to ordinal cumulative models. When adding predictor variables, what we would like is for any predictor, as it increases, predictions are moved progressively (increased) through the categories in sequence. A linear regression is formed for \(Z\) by adding a predictor term \(\eta\)

\[\eta = \beta_1 x_1 + \beta_2 x_2 +, . . ., \beta_n x_n\]

Notice how similar this looks to an ordinary linear model. However, there is no intercept or error term. This is because the intercept is replaced by the threshold \(\tau\) and the error term \(\epsilon\) is added seperately to obtain

\[Z = \eta + \epsilon\]

Putting the predictor term together with the thresholds and cumulative distribution function, we obtain the probability of \(Y\) being equal to a category \(k\) as

\[Pr(Y = k | \eta) = F(\tau_k - \eta) - F(\tau_{k-1} - \eta)\]

The same predictor term \(\eta\) is subtracted from each threshold because if we decrease the log-cumulative-odds of every outcome value \(k\) below the maximum, this shifts probability mass upwards towards higher outcome values. Thus, positive \(\beta\) values correspond to increasing \(x\), which is associated with an increase in the mean response \(Y\). The parameters to be estimated from the model are the thresholds \(\tau\) and the predictor terms \(\eta\) coefficients.

To add predictors for ordinal models in Bambi, we continue to use the formula interface.

model = bmb.Model(
    "response ~ 0 + action + intention + contact + action:intention + contact:intention", 
    data=trolly, 
    family="cumulative"
)
idata = model.fit(random_seed=1234)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [threshold, action, intention, contact, action:intention, contact:intention]
/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/site-packages/pytensor/compile/function/types.py:970: RuntimeWarning: invalid value encountered in accumulate
  self.vm()


Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 93 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics

In the summary dataframe below, we only select the predictor variables as the cutpoints are not of interest at the moment.

az.summary(
    idata, 
    var_names=["action", "intention", "contact", 
               "action:intention", "contact:intention"]
)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
action[1] -0.463 0.054 -0.564 -0.360 0.001 0.001 1418.0 1691.0 1.0
intention[1] -0.274 0.058 -0.380 -0.161 0.002 0.001 1264.0 1446.0 1.0
contact[1] -0.325 0.068 -0.445 -0.189 0.002 0.001 1203.0 1310.0 1.0
action:intention[1, 1] -0.455 0.081 -0.607 -0.303 0.002 0.001 1505.0 1625.0 1.0
contact:intention[1, 1] -1.279 0.099 -1.473 -1.109 0.003 0.002 1326.0 1456.0 1.0

The posterior distribution of the slopes are all negative indicating that each of these story features reduces the rating—the acceptability of the story. Below, a forest plot is used to make this insight more clear.

az.plot_forest(
    idata,
    combined=True,
    var_names=["action", "intention", "contact", 
               "action:intention", "contact:intention"],
    figsize=(7, 3),
    textsize=11
);

Again, we can plot the cumulative probability of each category. Compared to the same plot above, notice how most of the category probabilities have been shifted to the left. Additionally, there is more uncertainty for category 3, 4, and 5.

fig, ax = plt.subplots(figsize=(7, 3))
for i in range(6):
    outcome = expit_func(idata.posterior.threshold).sel(threshold_dim=i).to_numpy().flatten()
    ax.hist(outcome, bins=15, alpha=0.5, label=f"Category: {i}")
ax.set_xlabel("Probability")
ax.set_ylabel("Count")
ax.set_title("Cumulative Probability by Category")
ax.legend(bbox_to_anchor=(1.04, 1), loc="upper left");

Posterior predictive distribution

To get a sense of how well the ordinal model fits the data, we can plot samples from the posterior predictive distribution. To plot the samples, a utility function is defined below to assist in the plotting of discrete values.

def adjust_lightness(color, amount=0.5):
    import matplotlib.colors as mc
    import colorsys
    try:
        c = mc.cnames[color]
    except:
        c = color
    c = colorsys.rgb_to_hls(*mc.to_rgb(c))
    return colorsys.hls_to_rgb(c[0], c[1] * amount, c[2])

def plot_ppc_discrete(idata, bins, ax):
    
    def add_discrete_bands(x, lower, upper, ax, **kwargs):
        for i, (l, u) in enumerate(zip(lower, upper)):
            s = slice(i, i + 2)
            ax.fill_between(x[s], [l, l], [u, u], **kwargs)

    var_name = list(idata.observed_data.data_vars)[0]
    y_obs = idata.observed_data[var_name].to_numpy()
    
    counts_list = []
    for draw_values in az.extract(idata, "posterior_predictive")[var_name].to_numpy().T:
        counts, _ = np.histogram(draw_values, bins=bins)
        counts_list.append(counts)
    counts_arr = np.stack(counts_list)

    qts_90 = np.quantile(counts_arr, (0.05, 0.95), axis=0)
    qts_70 = np.quantile(counts_arr, (0.15, 0.85), axis=0)
    qts_50 = np.quantile(counts_arr, (0.25, 0.75), axis=0)
    qts_30 = np.quantile(counts_arr, (0.35, 0.65), axis=0)
    median = np.quantile(counts_arr, 0.5, axis=0)

    colors = [adjust_lightness("C0", x) for x in [1.8, 1.6, 1.4, 1.2, 0.9]]

    add_discrete_bands(bins, qts_90[0], qts_90[1], ax=ax, color=colors[0])
    add_discrete_bands(bins, qts_70[0], qts_70[1], ax=ax, color=colors[1])
    add_discrete_bands(bins, qts_50[0], qts_50[1], ax=ax, color=colors[2])
    add_discrete_bands(bins, qts_30[0], qts_30[1], ax=ax, color=colors[3])

    
    ax.step(bins[:-1], median, color=colors[4], lw=2, where="post")
    ax.hist(y_obs, bins=bins, histtype="step", lw=2, color="black", align="mid")
    handles = [
        Line2D([], [], label="Observed data", color="black", lw=2),
        Line2D([], [], label="Posterior predictive median", color=colors[4], lw=2)
    ]
    ax.legend(handles=handles)
    return ax
idata_pps = model.predict(idata=idata, kind="response", inplace=False)

bins = np.arange(7)
fig, ax = plt.subplots(figsize=(7, 3))
ax = plot_ppc_discrete(idata_pps, bins, ax)
ax.set_xlabel("Response category")
ax.set_ylabel("Count")
ax.set_title("Cumulative model - Posterior Predictive Distribution");

Sequential Model

For some ordinal variables, the assumption of a single underlying continuous variable (as in cumulative models) may not be appropriate. If the response can be understood as being the result of a sequential process, such that a higher response category is possible only after all lower categories are achieved, then a sequential model may be more appropriate than a cumulative model.

Sequential models assume that for every category \(k\) there is a latent continuous variable \(Z\) that determines the transition between categories \(k\) and \(k+1\). Now, a threshold \(\tau\) belongs to each latent process. If there are 3 categories, then there are 3 latent processes. If \(Z_k\) is greater than the threshold \(\tau_k\), the sequential process continues, otherwise it stops at category \(k\). As with the cumulative model, we assume a distribution for \(Z_k\) with a cumulative distribution function \(F\).

As an example, lets suppose we are interested in modeling the probability a boxer makes it to round 3. This implies that the particular boxer in question survived round 1 \(Z_1 > \tau_1\) , 2 \(Z_2 > \tau_2\), and 3 \(Z_3 > \tau_3\). This can be written as

\[Pr(Y = 3) = (1 - P(Z_1 \leq \tau_1)) * (1 - P(Z_2 \leq \tau_2)) * P(Z_3 \leq \tau_3)\]

As in the cumulative model above, if we assume \(Y\) to be normally distributed with the thresholds \(\tau_1 = -1, \tau_2 = 0, \tau_3 = 1\) and cumulative distribution function \(\Phi\) then

\[Pr(Y = 3) = (1 - \Phi(\tau_1)) * (1 - \Phi(\tau_2)) * \Phi(\tau_3)\]

To add predictors to this sequential model, we follow the same specification in the Adding Predictors section above. Thus, the sequential model with predictor terms becomes

\[P(Y = k) = F(\tau_k - \eta) * \prod_{j=1}^{k-1}{(1 - F(\tau_j - \eta))}\]

Thus, the probability that \(Y\) is equal to category \(k\) is equal to the probability that it did not fall in one of the former categories \(1: k-1\) multiplied by the probability that the sequential process stopped at \(k\) rather than continuing past it.

Human resources attrition dataset

To illustrate an sequential model with a stopping ratio link function, we will use data from the IBM human resources employee attrition and performance dataset. The original dataset contains 1470 rows and 35 columns. However, our goal is to model the total working years of employees using age as a predictor. This data lends itself to a sequential model as the response, total working years, is a sequential process. In order to have 10 years of working experience, it is necessarily true that the employee had 9 years of working experience. Additionally, age is choosen as a predictor as it is positively correlated with total working years.

attrition = pd.read_csv("data/hr_employee_attrition.tsv.txt", sep="\t")
attrition = attrition[attrition["Attrition"] == "No"]
attrition["YearsAtCompany"] = pd.Categorical(attrition["YearsAtCompany"], ordered=True)
attrition[["YearsAtCompany", "Age"]].head()
YearsAtCompany Age
1 10 49
3 8 33
4 2 27
5 7 32
6 1 59

Below, the empirical probabilities of the response categories are computed. Employees are most likely to stay at the company between 1 and 10 years.

pr_k = attrition.YearsAtCompany.value_counts().sort_index().values / attrition.shape[0]

plt.figure(figsize=(7, 3))
plt.bar(np.arange(0, 36), pr_k)
plt.xlabel("Response category")
plt.ylabel("Probability")
plt.title("Empirical probability of each response category");

Default prior of thresholds

Before we fit the sequential model, it’s worth mentioning that the default priors for the thresholds in a sequential model are different than the cumulative model. In the cumulative model, the default prior for the thresholds is a Normal distribution with a grid of evenly spaced \(\mu\) where an ordered transformation is applied to ensure the ordering of the values. However, in the sequential model, the ordering of the thresholds does not matter. Thus, the default prior for the thresholds is a Normal distribution with a zero \(\mu\) vector of length \(k - 1\) where \(k\) is the number of response levels. Refer to the getting started docs if you need a refresher on priors in Bambi.

Subsequently, fitting a sequential model is similar to fitting a cumulative model. The only difference is that we pass family="sratio" to the bambi.Model constructor.

sequence_model = bmb.Model(
    "YearsAtCompany ~ 0 + TotalWorkingYears", 
    data=attrition, 
    family="sratio"
)
sequence_idata = sequence_model.fit(random_seed=1234)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [threshold, TotalWorkingYears]


Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 370 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
sequence_model
       Formula: YearsAtCompany ~ 0 + TotalWorkingYears
        Family: sratio
          Link: p = logit
  Observations: 1233
        Priors: 
    target = p
        Common-level effects
            TotalWorkingYears ~ Normal(mu: 0.0, sigma: 0.3223)
        
        Auxiliary parameters
            threshold ~ Normal(mu: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
             0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], sigma: 1.0)
------
* To see a plot of the priors call the .plot_priors() method.
* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
az.summary(sequence_idata)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
TotalWorkingYears 0.127 0.006 0.117 0.138 0.000 0.000 458.0 719.0 1.00
threshold[0] -2.518 0.188 -2.849 -2.154 0.004 0.003 2139.0 893.0 1.00
threshold[1] -1.051 0.106 -1.253 -0.861 0.003 0.002 1668.0 1416.0 1.00
threshold[2] -1.014 0.117 -1.237 -0.806 0.003 0.002 1480.0 1458.0 1.00
threshold[3] -0.757 0.113 -0.973 -0.554 0.003 0.002 1200.0 1390.0 1.00
threshold[4] -0.759 0.122 -0.978 -0.521 0.003 0.002 1394.0 1346.0 1.00
threshold[5] 0.250 0.108 0.060 0.458 0.003 0.002 1069.0 1283.0 1.00
threshold[6] -0.505 0.148 -0.790 -0.228 0.004 0.003 1436.0 1496.0 1.00
threshold[7] -0.085 0.138 -0.338 0.171 0.004 0.003 1373.0 1364.0 1.00
threshold[8] 0.028 0.143 -0.228 0.305 0.005 0.003 980.0 1503.0 1.00
threshold[9] 0.388 0.151 0.101 0.662 0.004 0.003 1306.0 1373.0 1.00
threshold[10] 1.269 0.153 0.948 1.523 0.005 0.004 801.0 1164.0 1.00
threshold[11] 0.440 0.218 0.061 0.871 0.006 0.004 1224.0 1354.0 1.00
threshold[12] -0.105 0.297 -0.643 0.454 0.007 0.006 1733.0 1697.0 1.00
threshold[13] 0.517 0.242 0.084 0.992 0.007 0.005 1361.0 1272.0 1.00
threshold[14] 0.390 0.280 -0.151 0.910 0.007 0.005 1730.0 1358.0 1.00
threshold[15] 0.797 0.264 0.289 1.273 0.007 0.005 1632.0 1534.0 1.00
threshold[16] 0.426 0.335 -0.193 1.066 0.008 0.006 1787.0 1330.0 1.00
threshold[17] 0.249 0.372 -0.444 0.944 0.010 0.007 1390.0 1357.0 1.00
threshold[18] 0.794 0.335 0.180 1.423 0.008 0.006 1591.0 1638.0 1.00
threshold[19] 0.793 0.370 0.110 1.499 0.009 0.007 1689.0 1201.0 1.00
threshold[20] 2.201 0.272 1.723 2.725 0.008 0.006 1133.0 1333.0 1.00
threshold[21] 1.842 0.349 1.177 2.449 0.009 0.007 1388.0 1316.0 1.00
threshold[22] 2.444 0.342 1.762 3.046 0.009 0.006 1544.0 1429.0 1.00
threshold[23] 0.030 0.724 -1.373 1.291 0.013 0.016 3182.0 1571.0 1.00
threshold[24] 1.587 0.538 0.516 2.535 0.013 0.009 1902.0 1429.0 1.00
threshold[25] 1.507 0.597 0.292 2.507 0.011 0.008 2841.0 1557.0 1.00
threshold[26] 1.735 0.632 0.582 2.871 0.012 0.008 3066.0 1220.0 1.01
threshold[27] 1.047 0.749 -0.325 2.439 0.014 0.010 2780.0 1363.0 1.00
threshold[28] 1.156 0.755 -0.300 2.467 0.014 0.010 2965.0 1382.0 1.00
threshold[29] 0.543 0.850 -1.078 2.144 0.016 0.016 2806.0 1360.0 1.00
threshold[30] 1.268 0.788 -0.124 2.844 0.014 0.012 3180.0 1638.0 1.00
threshold[31] 1.373 0.781 -0.118 2.799 0.014 0.011 2961.0 1629.0 1.00
threshold[32] 2.653 0.732 1.315 4.084 0.015 0.011 2384.0 1292.0 1.00
threshold[33] 0.861 0.942 -0.994 2.554 0.014 0.016 4345.0 1567.0 1.00
threshold[34] 1.757 0.892 0.128 3.399 0.018 0.013 2441.0 1202.0 1.00

The coefficients are still on the logits scale, so we need to apply the inverse of the logit function to transform back to probabilities. Below, we plot the probabilities for each category.

probs = expit_func(sequence_idata.posterior.threshold).mean(("chain", "draw"))
probs = np.append(probs, 1)

plt.figure(figsize=(7, 3))
plt.plot(sorted(attrition.YearsAtCompany.unique()), probs, marker='o')
plt.ylabel("Probability")
plt.xlabel("Response category");

This plot can seem confusing at first. Remember, the sequential model is a product of probabilities, i.e., the probability that \(Y\) is equal to category \(k\) is equal to the probability that it did not fall in one of the former categories \(1: k-1\) multiplied by the probability that the sequential process stopped at \(k\). Thus, the probability of category 5 is the probability that the sequential process did not fall in 0, 1, 2, 3, or 4 multiplied by the probability that the sequential process stopped at 5. This makes sense why the probability of category 36 is 1. There is no category after 36, so once you multiply all of the previous probabilities with the current category, you get 1. This is the reason for the “cumulative-like” shape of the plot. But if the coefficients were truly cumulative, the probability could not decreases as \(k\) increases.

Posterior predictive samples

Again, using the posterior predictive samples, we can visualize the model fit against the observed data. In the case of the sequential model, the model does an alright job of capturing the observed frequencies of the categories. For pedagogical purposes, this fit is sufficient.

idata_pps = model.predict(idata=idata, kind="response", inplace=False)

bins = np.arange(35)
fig, ax = plt.subplots(figsize=(7, 3))
ax = plot_ppc_discrete(idata_pps, bins, ax)
ax.set_xlabel("Response category")
ax.set_ylabel("Count")
ax.set_title("Sequential model - Posterior Predictive Distribution");

Summary

This notebook demonstrated how to fit cumulative and sequential ordinal regression models using Bambi. Cumulative models focus on modeling the cumulative probabilities of an ordinal outcome variable taking on values up to and including a certain category, whereas a sequential model focuses on modeling the probability that an ordinal outcome variable stops at a particular category, rather than continuing to higher categories. To achieve this, both models assume that the reponse variable originates from a categorization of a latent continuous variable \(Z\). However, the cumulative model assumes that there are \(K\) thresholds \(\tau_k\) that partition \(Z\) into \(K+1\) observable, ordered categories of \(Y\). The sequential model assumes that for every category \(k\) there is a latent continuous variable \(Z\) that determines the transition between categories \(k\) and \(k+1\); thus, a threshold \(\tau\) belongs to each latent process.

Cumulative models can be used in situations where the outcome variable is on the Likert scale, and you are interested in understanding the impact of predictors on the probability of reaching or exceeding specific categories. Sequential models are particularly useful when you are interested in understanding the predictors that influence the decision to stop at a specific response level. It’s well-suited for analyzing data where categories represent stages, and the focus is on the transitions between these stages.

%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Sat May 25 2024

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

bambi     : 0.13.1.dev37+g2a54df76.d20240525
arviz     : 0.18.0
numpy     : 1.26.4
matplotlib: 3.8.4
pandas    : 2.2.2

Watermark: 2.4.3