import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Robust Linear Regression
This example has been lifted from the PyMC Docs, and adapted to for Bambi by Tyler James Burch (@tjburch on GitHub).
Many toy datasets circumvent problems that practitioners run into with real data. Specifically, the assumption of normality can be easily violated by outliers, which can cause havoc in traditional linear regression. One way to navigate this is through robust linear regression, outlined in this example.
First load modules and set the RNG for reproducibility.
"arviz-darkgrid")
az.style.use(1111) np.random.seed(
Next, generate pseudodata. The bulk of the data will be linear with noise distributed normally, but additionally several outliers will be interjected.
= 100
size = 1
true_intercept = 2
true_slope
= np.linspace(0, 1, size)
x # y = a + b*x
= true_intercept + true_slope * x
true_regression_line # add noise
= true_regression_line + np.random.normal(scale=0.5, size=size)
y
# Add outliers
= np.append(x, [0.1, 0.15, 0.2])
x_out = np.append(y, [8, 6, 9])
y_out
= pd.DataFrame({
data "x": x_out,
"y": y_out
})
Plot this data. The three data points in the top left are the interjected data.
= plt.figure(figsize=(7, 7))
fig = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax "x", label="sampled data")
ax.plot(x_out, y_out, ="true regression line", lw=2.0)
ax.plot(x, true_regression_line, label=0); plt.legend(loc
To highlight the problem, first fit a standard normally-distributed linear regression.
# Note, "gaussian" is the default argument for family. Added to be explicit.
= bmb.Model("y ~ x", data, family="gaussian")
gauss_model = gauss_model.fit(draws=2000, idata_kwargs={"log_likelihood": True})
gauss_fitted ="pps") gauss_model.predict(gauss_fitted, kind
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, Intercept, x]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
/home/tomas/Desktop/OSS/bambinos/bambi/bambi/models.py:846: FutureWarning: 'pps' has been replaced by 'response' and is not going to work in the future
warnings.warn(
az.summary(gauss_fitted)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 1.532 | 0.230 | 1.103 | 1.972 | 0.003 | 0.002 | 5262.0 | 2988.0 | 1.0 |
sigma | 1.186 | 0.085 | 1.023 | 1.345 | 0.001 | 0.001 | 5676.0 | 2953.0 | 1.0 |
x | 1.203 | 0.399 | 0.423 | 1.913 | 0.005 | 0.004 | 5501.0 | 3207.0 | 1.0 |
mu[0] | 1.532 | 0.230 | 1.103 | 1.972 | 0.003 | 0.002 | 5262.0 | 2988.0 | 1.0 |
mu[1] | 1.544 | 0.226 | 1.124 | 1.982 | 0.003 | 0.002 | 5263.0 | 2988.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
mu[98] | 2.722 | 0.230 | 2.317 | 3.178 | 0.003 | 0.002 | 6078.0 | 2990.0 | 1.0 |
mu[99] | 2.735 | 0.233 | 2.310 | 3.185 | 0.003 | 0.002 | 6071.0 | 2979.0 | 1.0 |
mu[100] | 1.652 | 0.196 | 1.290 | 2.029 | 0.003 | 0.002 | 5287.0 | 3085.0 | 1.0 |
mu[101] | 1.712 | 0.181 | 1.383 | 2.059 | 0.002 | 0.002 | 5309.0 | 3236.0 | 1.0 |
mu[102] | 1.772 | 0.166 | 1.467 | 2.090 | 0.002 | 0.002 | 5357.0 | 3234.0 | 1.0 |
106 rows × 9 columns
Remember, the true intercept was 1, the true slope was 2. The recovered intercept is much higher, and the slope is much lower, so the influence of the outliers is apparent.
Visually, looking at the recovered regression line and posterior predictive HDI highlights the problem further.
=(7, 5))
plt.figure(figsize# Plot Data
"x", label="data")
plt.plot(x_out, y_out, # Plot recovered linear regression
= np.linspace(min(x_out), max(x_out), 2000)
x_range = gauss_fitted.posterior.x.mean().item() * x_range + gauss_fitted.posterior.Intercept.mean().item()
y_pred
plt.plot(x_range, y_pred, ="black",linestyle="--",
color="Recovered regression line"
label
)# Plot HDIs
for interval in [0.38, 0.68]:
az.plot_hdi(x_out, gauss_fitted.posterior_predictive.y, =interval, color="firebrick")
hdi_prob# Plot true regression line
plt.plot(x, true_regression_line, ="True regression line", lw=2.0, color="black")
label=0); plt.legend(loc
The recovered regression line, as well as the \(0.5\sigma\) and \(1\sigma\) bands are shown.
Clearly there is skew in the fit. At lower \(x\) values, the regression line is far higher than the true line. This is a result of the outliers, which cause the model to assume a higher value in that regime.
Additionally the uncertainty bands are too wide (remember, the \(1\sigma\) band ought to cover 68% of the data, while here it covers most of the points). Due to the small probability mass in the tails of a normal distribution, the outliers have an large effect, causing the uncertainty bands to be oversized.
Clearly, assuming the data are distributed normally is inducing problems here. Bayesian robust linear regression forgoes the normality assumption by instead using a Student T distribution to describe the distribution of the data. The Student T distribution has thicker tails, and by allocating more probability mass to the tails, outliers have a less strong effect.
Comparing the two distributions,
= np.random.normal(loc=0, scale=1, size=100_000)
normal_data = np.random.standard_t(df=1, size=100_000)
t_data
= np.arange(-8,8,0.15)
bins
plt.hist(normal_data, =bins, density=True,
bins=0.6,
alpha="Normal"
label
)
plt.hist(t_data, =bins,density=True,
bins=0.6,
alpha="Student T"
label
)"x")
plt.xlabel("Probability density")
plt.ylabel(-8,8)
plt.xlim(; plt.legend()
As we can see, the tails of the Student T are much larger, which means values far from the mean are more likely when compared to the normal distribution.
The T distribution is specified by a number of degrees of freedom (\(\nu\)). In numpy.random.standard_t
this is the parameter df
, in the pymc T distribution, it’s nu
. It is constrained to real numbers greater than 0. As the degrees of freedom increase, the probability in the tails Student T distribution decrease. In the limit of \(\nu \rightarrow + \infty\), the Student T distribution is a normal distribution. Below, the T distribution is plotted for various \(\nu\).
= np.arange(-8,8,0.15)
bins for ndof in [0.1, 1, 10]:
= np.random.standard_t(df=ndof, size=100_000)
t_data
plt.hist(t_data, =bins,density=True,
bins=f"$\\nu = {ndof}$",
label="step"
histtype
)
plt.hist(normal_data, =bins, density=True,
bins="step",
histtype="Normal"
label
)
"x")
plt.xlabel("Probability density")
plt.ylabel(-6,6)
plt.xlim(; plt.legend()
In Bambi, the way to specify a regression with Student T distributed data is by passing "t"
to the family
parameter of a Model.
= bmb.Model("y ~ x", data, family="t")
t_model = t_model.fit(draws=2000, idata_kwargs={"log_likelihood": True})
t_fitted ="pps") t_model.predict(t_fitted, kind
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [nu, sigma, Intercept, x]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 4 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
/home/tomas/Desktop/OSS/bambinos/bambi/bambi/models.py:846: FutureWarning: 'pps' has been replaced by 'response' and is not going to work in the future
warnings.warn(
az.summary(t_fitted)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 0.993 | 0.109 | 0.793 | 1.193 | 0.002 | 0.001 | 5305.0 | 3502.0 | 1.0 |
nu | 2.610 | 0.610 | 1.533 | 3.721 | 0.009 | 0.007 | 4246.0 | 3662.0 | 1.0 |
sigma | 0.406 | 0.045 | 0.326 | 0.489 | 0.001 | 0.001 | 3669.0 | 3243.0 | 1.0 |
x | 1.900 | 0.189 | 1.533 | 2.242 | 0.002 | 0.002 | 6136.0 | 3533.0 | 1.0 |
mu[0] | 0.993 | 0.109 | 0.793 | 1.193 | 0.002 | 0.001 | 5305.0 | 3502.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
mu[98] | 2.875 | 0.105 | 2.683 | 3.077 | 0.001 | 0.001 | 6187.0 | 3392.0 | 1.0 |
mu[99] | 2.894 | 0.107 | 2.700 | 3.100 | 0.001 | 0.001 | 6199.0 | 3308.0 | 1.0 |
mu[100] | 1.183 | 0.093 | 1.018 | 1.359 | 0.001 | 0.001 | 5178.0 | 3469.0 | 1.0 |
mu[101] | 1.278 | 0.085 | 1.126 | 1.440 | 0.001 | 0.001 | 5108.0 | 3472.0 | 1.0 |
mu[102] | 1.373 | 0.078 | 1.226 | 1.516 | 0.001 | 0.001 | 5016.0 | 3384.0 | 1.0 |
107 rows × 9 columns
Note the new parameter in the model, y_nu
. This is the aforementioned degrees of freedom. If this number were very high, we would expect it to be well described by a normal distribution. However, the HDI of this spans from 1.5 to 3.7, meaning that the tails are much heavier than a normal distribution. As a result of the heavier tails, y_sigma
has also dropped precipitously from the normal model, meaning the oversized uncertainty bands from above have shrunk.
Comparing the extracted values of the two models,
def get_slope_intercept(mod):
return (
mod.posterior.x.mean().item(),
mod.posterior.Intercept.mean().item()
)= get_slope_intercept(gauss_fitted)
gauss_slope, gauss_int = get_slope_intercept(t_fitted)
t_slope, t_int
pd.DataFrame({"Model":["True","Normal","T"],
"Slope":[2, gauss_slope, t_slope],
"Intercept": [1, gauss_int, t_int]
"Model").T.round(decimals=2) }).set_index(
Model | True | Normal | T |
---|---|---|---|
Slope | 2.0 | 1.20 | 1.90 |
Intercept | 1.0 | 1.53 | 0.99 |
Here we can see the mean recovered values of both the slope and intercept are far closer to the true values using the robust regression model compared to the normally distributed one.
Visually comparing the robust regression line,
=(7, 5))
plt.figure(figsize# Plot Data
"x", label="data")
plt.plot(x_out, y_out, # Plot recovered robust linear regression
= np.linspace(min(x_out), max(x_out), 2000)
x_range = t_fitted.posterior.x.mean().item() * x_range + t_fitted.posterior.Intercept.mean().item()
y_pred
plt.plot(x_range, y_pred, ="black",linestyle="--",
color="Recovered regression line"
label
)# Plot HDIs
for interval in [0.05, 0.38, 0.68]:
az.plot_hdi(x_out, t_fitted.posterior_predictive.y, =interval, color="firebrick")
hdi_prob# Plot true regression line
plt.plot(x, true_regression_line, ="true regression line", lw=2.0, color="black")
label=0); plt.legend(loc
This is much better. The true and recovered regression lines are much closer, and the uncertainty bands are appropriate sized. The effect of the outliers is not entirely gone, the recovered line still slightly differs from the true line, but the effect is far smaller, which is a result of the Student T likelihood function ascribing a higher probability to outliers than the normal distribution. Additionally, this inference is based on sampling methods, so it is expected to have small differences (especially given a relatively small number of samples).
Last, another way to evaluate the models is to compare based on Leave-one-out Cross-validation (LOO), which provides an estimate of accuracy on out-of-sample predictions.
= {
models "gaussian": gauss_fitted,
"Student T": t_fitted
}= az.compare(models)
df_compare df_compare
/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/site-packages/arviz/stats/stats.py:789: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
warnings.warn(
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
Student T | 0 | -101.645151 | 5.484467 | 0.000000 | 1.0 | 14.932301 | 0.000000 | False | log |
gaussian | 1 | -172.368370 | 14.605692 | 70.723218 | 0.0 | 29.836317 | 18.037873 | True | log |
=False); az.plot_compare(df_compare, insample_dev
Here it is quite obvious that the Student T model is much better, due to having a clearly larger value of LOO.
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Sat May 25 2024
Python implementation: CPython
Python version : 3.11.9
IPython version : 8.24.0
pandas : 2.2.2
matplotlib: 3.8.4
bambi : 0.13.1.dev37+g2a54df76.d20240525
numpy : 1.26.4
arviz : 0.18.0
Watermark: 2.4.3