import warnings
import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
="ignore", category=FutureWarning) warnings.simplefilter(action
Multilevel Regression and Post-stratification
Regression and Effect Modification
What are we even doing when we fit a regression model? Is a question that arises when first learning the tools of the trade and again when debugging strange results of your thousandth logistic regression model.
This notebook is intended to showcase how regression can be seen as a method for automating the calculation of stratum specific conditional effects. Additionally, we’ll see how we can enrich regression models by a post-stratification adjustment with knowledge of the appropriate stratum specific weights. This technique of multilevel regression and post stratification (MrP) is often used in the context of national surveys where we have knowledge of the population weights appropriate to different demographic groups. It can be used in a wide variety of areas ranging from political polling to online market research. We will demonstrate how to fit and and assess these models using Bambi.
Risk Stratification
First consider this example of heart transplant patients adapted from Hernan and Robins’ excellent book Causal Inference: What if. Here we have a number of patients (anonymised with names for the Greek Gods). The data records the outcomes of a heart transplant program for those who were part of the program and those who were not. We also see the different risk levels of each patient assigned the treatment.
What we want to show here is that a regression model fit to this data automatically accounts for the weighting appropriate to the different risk strata. The data is coded with 0-1 indicators for status. Risk_Strata
is either 1 for higher risk or 0 for lower risk. Outcome
is whether or not the patient died from the procedure, and Treatment
is whether or not the patient received treatment.
= pd.DataFrame(
df
{"name": [
"Rheia",
"Kronos",
"Demeter",
"Hades",
"Hestia",
"Poseidon",
"Hera",
"Zeus",
"Artemis",
"Apollo",
"Leto",
"Ares",
"Athena",
"Hephaestus",
"Aphrodite",
"Cyclope",
"Persephone",
"Hermes",
"Hebe",
"Dionysus",
],"Risk_Strata": [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"Treatment": [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
"Outcome": [0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],
}
)
"Treatment_x_Risk_Strata"] = df.Treatment * df.Risk_Strata
df[
df
name | Risk_Strata | Treatment | Outcome | Treatment_x_Risk_Strata | |
---|---|---|---|---|---|
0 | Rheia | 0 | 0 | 0 | 0 |
1 | Kronos | 0 | 0 | 1 | 0 |
2 | Demeter | 0 | 0 | 0 | 0 |
3 | Hades | 0 | 0 | 0 | 0 |
4 | Hestia | 0 | 1 | 0 | 0 |
5 | Poseidon | 0 | 1 | 0 | 0 |
6 | Hera | 0 | 1 | 0 | 0 |
7 | Zeus | 0 | 1 | 1 | 0 |
8 | Artemis | 1 | 0 | 1 | 0 |
9 | Apollo | 1 | 0 | 1 | 0 |
10 | Leto | 1 | 0 | 0 | 0 |
11 | Ares | 1 | 1 | 1 | 1 |
12 | Athena | 1 | 1 | 1 | 1 |
13 | Hephaestus | 1 | 1 | 1 | 1 |
14 | Aphrodite | 1 | 1 | 1 | 1 |
15 | Cyclope | 1 | 1 | 1 | 1 |
16 | Persephone | 1 | 1 | 1 | 1 |
17 | Hermes | 1 | 1 | 0 | 1 |
18 | Hebe | 1 | 1 | 0 | 1 |
19 | Dionysus | 1 | 1 | 0 | 1 |
If the treatment assignment procedure involved complete randomisation then we might expect a reasonable balance of strata effects across the treated and non-treated. In this sample we see (perhaps counter intuitively) that the treatment seems to induce a higher rate of death than the non-treated group.
= df.groupby("Treatment")[["Outcome"]].mean().rename({"Outcome": "Share"}, axis=1)
simple_average simple_average
Share | |
---|---|
Treatment | |
0 | 0.428571 |
1 | 0.538462 |
Which suggests an alarming causal effect whereby the treatment seems to increase risk of death in the population.
= simple_average.iloc[1]["Share"] / simple_average.iloc[0]["Share"]
causal_risk_ratio print("Causal Risk Ratio:", causal_risk_ratio)
Causal Risk Ratio: 1.2564102564102564
This finding we know on inspection is driven by the imbalance in the risk strata across the treatment groups.
"Risk_Strata")[["Treatment"]].count().assign(
df.groupby(=lambda x: x["Treatment"] / len(df)
proportion )
Treatment | proportion | |
---|---|---|
Risk_Strata | ||
0 | 8 | 0.4 |
1 | 12 | 0.6 |
We can correct for this by weighting the results by the share each group represents across the Risk_Strata
. In other words when we correct for the population size at the different levels of risk we get a better estimate of the effect. First we see what the expected outcome is for each strata.
= (
outcomes_controlled "Risk_Strata", "Treatment"])[["Outcome"]]
df.groupby([
.mean()
.reset_index()="Treatment", columns=["Risk_Strata"], values="Outcome")
.pivot(index
)
outcomes_controlled
Risk_Strata | 0 | 1 |
---|---|---|
Treatment | ||
0 | 0.25 | 0.666667 |
1 | 0.25 | 0.666667 |
Note how the expected outcomes are equal across the stratified groups. We can now combine these estimate with the population weights (derived earlier) in each segment to get our weighted average.
= outcomes_controlled.assign(formula="0.4*0.25 + 0.6*0.66").assign(
weighted_avg =lambda x: x[0] * (df[df["Risk_Strata"] == 0].shape[0] / len(df))
weighted_average+ x[1] * (df[df["Risk_Strata"] == 1].shape[0] / len(df))
)
weighted_avg
Risk_Strata | 0 | 1 | formula | weighted_average |
---|---|---|---|---|
Treatment | ||||
0 | 0.25 | 0.666667 | 0.4*0.25 + 0.6*0.66 | 0.5 |
1 | 0.25 | 0.666667 | 0.4*0.25 + 0.6*0.66 | 0.5 |
From which we can derive a more sensible treatment effect.
= (
causal_risk_ratio 1]["weighted_average"] / weighted_avg.iloc[0]["weighted_average"]
weighted_avg.iloc[
)
print("Causal Risk Ratio:", causal_risk_ratio)
Causal Risk Ratio: 1.0
Regression as Stratification
So far, so good. But so what?
The point here is that the above series of steps can be difficult to accomplish with more complex sets of groups and risk profiles. So it’s useful to understand that regression can be used to automatically account for the variation in outcome effects across the different strata of our population. More prosaically, the example shows that it really matters what variables you put in your model.
= bmb.Model("Outcome ~ 1 + Treatment", df)
reg = reg.fit()
results
= bmb.Model("Outcome ~ 1 + Treatment + Risk_Strata + Treatment_x_Risk_Strata", df)
reg_strata = reg_strata.fit() results_strata
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, Intercept, Treatment]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 2 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, Intercept, Treatment, Risk_Strata, Treatment_x_Risk_Strata]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
We can now inspect the treatment effect and the implied causal risk ratio in each model. We can quickly recover that controlling for the right variables in our regression model automatically adjusts the treatment effect downwards towards 0.
az.summary(results)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 0.433 | 0.214 | 0.036 | 0.830 | 0.004 | 0.003 | 2394.0 | 1504.0 | 1.0 |
Treatment | 0.109 | 0.275 | -0.379 | 0.641 | 0.006 | 0.007 | 2084.0 | 1071.0 | 1.0 |
sigma | 0.545 | 0.093 | 0.391 | 0.722 | 0.002 | 0.002 | 1831.0 | 1501.0 | 1.0 |
az.summary(results_strata)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 0.252 | 0.266 | -0.274 | 0.713 | 0.008 | 0.005 | 1212.0 | 1628.0 | 1.00 |
Risk_Strata | 0.408 | 0.404 | -0.302 | 1.165 | 0.014 | 0.010 | 788.0 | 1160.0 | 1.01 |
Treatment | -0.004 | 0.366 | -0.706 | 0.645 | 0.012 | 0.008 | 1016.0 | 1278.0 | 1.00 |
Treatment_x_Risk_Strata | 0.014 | 0.493 | -0.967 | 0.882 | 0.018 | 0.013 | 748.0 | 1009.0 | 1.00 |
sigma | 0.531 | 0.094 | 0.361 | 0.693 | 0.003 | 0.002 | 1413.0 | 1201.0 | 1.00 |
= az.plot_forest(
ax
[results, results_strata],=["naive_model", "stratified_model"],
model_names=["Treatment"],
var_names="ridgeplot",
kind=0.4,
ridgeplot_alpha=True,
combined=(10, 6),
figsize
)0].axvline(0, color="black", linestyle="--")
ax[0].set_title("Treatment Effects under Stratification/Non-stratification"); ax[
We can even see this in the predicted outcomes for the model. This is an important step. The regression model automatically adjusts for the risk profile within the appropriate strata in the data “seen” by the model.
= df[["Risk_Strata"]].assign(Treatment=1).assign(Treatment_x_Risk_Strata=1)
new_df = reg_strata.predict(results_strata, kind="pps", data=new_df, inplace=False)
new_preds print("Expected Outcome in the Treated")
"posterior_predictive"]["Outcome"].mean().item() new_preds[
Expected Outcome in the Treated
0.5073302224271066
= df[["Risk_Strata"]].assign(Treatment=0).assign(Treatment_x_Risk_Strata=0)
new_df = reg_strata.predict(results_strata, kind="pps", data=new_df, inplace=False)
new_preds print("Expected Outcome in the Untreated")
"posterior_predictive"]["Outcome"].mean().item() new_preds[
Expected Outcome in the Untreated
0.4947595268000831
We can see these results more clearly using bambi
model interpretation functions to see the predictions within a specific strata.
= plt.subplots(1, 2, figsize=(20, 6))
fig, axs = axs.flatten()
axs =["Treatment"], ax=axs[0])
bmb.interpret.plot_predictions(reg, results, conditional=["Treatment"], ax=axs[1])
bmb.interpret.plot_predictions(reg_strata, results_strata, conditional0].set_title("Non Stratified Regression \n Model Predictions")
axs[1].set_title("Stratified Regression \n Model Predictions"); axs[
Default computed for conditional variable: Treatment
Default computed for conditional variable: Treatment
Default computed for unspecified variable: Risk_Strata, Treatment_x_Risk_Strata
Hernan and Robins expand on these foundational observations and elaborate the implications for causal inference and the bias of confounding variables. We won’t go into these details, as we instead we want to draw out the connection with controlling for the risk of non-representative sampling. The usefulness of “representative-ness” as an idea is disputed in the statistical literature due to the vagueness of the term. To say a sample is representative is ussually akin to meaning that it was generated from a high-quality probability sampling design. This design is specified to avoid the creep of bias due to selection effects contaminating the results.
We’ve seen how regression can automate stratification across the levels of covariates in the model conditional on the sample data. But what if the prevalence of the risk-profile in your data does not reflect the prevalance of risk in the wider population? Then the regression model will automatically adjust to the prevalence in the sample, but it is not adjusting to the correct weights.
The Need for Post-Stratification
In the context of national survey design there is always a concern that the sample respondents may be more or less representative of the population across different key demographics e.g. it’s unlikely we would put much faith in the survey’s accuracy if it had 90% male respondents on a question about the lived experience of women. Given that we can know before hand that certain demographic splits are not relective of the census data, we can use this information to appropriately re-weight the regressions fit to non-representative survey data.
We’ll demonstrate the idea of multi-level regression and post-stratification adjustment by replicating some of the steps discussed in Martin, Philips and Gelmen’s “Multilevel Regression and Poststratification Case Studies”.
They cite data from the Cooperative Congressional Election Study (Schaffner, Ansolabehere, and Luks (2018)), a US nationwide survey designed by a consortium of 60 research teams and administered by YouGov. The outcome of interest is a binary question: Should employers decline coverage of abortions in insurance plans?
= pd.read_csv("data/mr_p_cces18_common_vv.csv.gz", low_memory=False)
cces_all_df cces_all_df.head()
caseid | commonweight | commonpostweight | vvweight | vvweight_post | tookpost | CCEStake | birthyr | gender | educ | ... | CL_party | CL_2018gvm | CL_2018pep | CL_2018pvm | starttime | endtime | starttime_post | endtime_post | DMA | dmaname | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 123464282 | 0.940543 | 0.7936 | 0.740858 | 0.641412 | 2 | 1 | 1964 | 2 | 4 | ... | 11.0 | 1.0 | NaN | NaN | 04oct2018 02:47:10 | 09oct2018 04:16:31 | 11nov2018 00:41:13 | 11nov2018 01:21:53 | 512.0 | BALTIMORE |
1 | 170169205 | 0.769724 | 0.7388 | 0.425236 | 0.415134 | 2 | 1 | 1971 | 2 | 2 | ... | 13.0 | NaN | 6.0 | 2.0 | 02oct2018 06:55:22 | 02oct2018 07:32:51 | 12nov2018 00:49:50 | 12nov2018 01:08:43 | 531.0 | "TRI-CITIES |
2 | 175996005 | 1.491642 | 1.3105 | 1.700094 | 1.603264 | 2 | 1 | 1958 | 2 | 3 | ... | 13.0 | 5.0 | NaN | NaN | 07oct2018 00:48:23 | 07oct2018 01:38:41 | 12nov2018 21:49:41 | 12nov2018 22:19:28 | 564.0 | CHARLESTON-HUNTINGTON |
3 | 176818556 | 5.104709 | 4.6304 | 5.946729 | 5.658840 | 2 | 1 | 1946 | 2 | 6 | ... | 4.0 | 3.0 | NaN | 3.0 | 11oct2018 15:20:26 | 11oct2018 16:18:42 | 11nov2018 13:24:16 | 11nov2018 14:00:14 | 803.0 | LOS ANGELES |
4 | 202120533 | 0.466526 | 0.3745 | 0.412451 | 0.422327 | 2 | 1 | 1972 | 2 | 2 | ... | 3.0 | 5.0 | NaN | NaN | 08oct2018 02:31:28 | 08oct2018 03:03:48 | 15nov2018 01:04:16 | 15nov2018 01:57:21 | 529.0 | LOUISVILLE |
5 rows × 526 columns
Cleaning Census Data
To prepare the census data for modelling we need to break the demographic data into appropriate stratum. We will break out these groupings as along broad categories familiar to audiences of election coverage news. Even these steps amount to a significant choice where we use our knowledge of pertinent demographics to decide upon the key strata we wish to represent in our model, as we seek to better predict and understand the voting outcome.
= [
states "AL",
"AK",
"AZ",
"AR",
"CA",
"CO",
"CT",
"DE",
"FL",
"GA",
"HI",
"ID",
"IL",
"IN",
"IA",
"KS",
"KY",
"LA",
"ME",
"MD",
"MA",
"MI",
"MN",
"MS",
"MO",
"MT",
"NE",
"NV",
"NH",
"NJ",
"NM",
"NY",
"NC",
"ND",
"OH",
"OK",
"OR",
"PA",
"RI",
"SC",
"SD",
"TN",
"TX",
"UT",
"VT",
"VA",
"WA",
"WV",
"WI",
"WY",
]
= list(range(1, 56, 1))
numbers
= dict(zip(numbers, states))
lkup_states
lkup_states
= [
ethnicity "White",
"Black",
"Hispanic",
"Asian",
"Native American",
"Mixed",
"Other",
"Middle Eastern",
]= list(range(1, 9, 1))
numbers = dict(zip(numbers, ethnicity))
lkup_ethnicity
lkup_ethnicity
= ["No HS", "HS", "Some college", "Associates", "4-Year College", "Post-grad"]
edu = list(range(1, 7, 1))
numbers = dict(zip(numbers, edu))
lkup_edu
def clean_df(df):
## 0 Oppose and 1 Support
"abortion"] = np.abs(df["CC18_321d"] - 2)
df["state"] = df["inputstate"].map(lkup_states)
df[## dichotomous (coded as -0.5 Female, +0.5 Male)
"male"] = np.abs(df["gender"] - 2) - 0.5
df["eth"] = df["race"].map(lkup_ethnicity)
df["eth"] = np.where(
df["eth"].isin(["Asian", "Other", "Middle Eastern", "Mixed", "Native American"]),
df["Other",
"eth"],
df[
)"age"] = 2018 - df["birthyr"]
df["age"] = pd.cut(
df["age"].astype(int),
df[0, 29, 39, 49, 59, 69, 120],
[=["18-29", "30-39", "40-49", "50-59", "60-69", "70+"],
labels=True,
ordered
)"edu"] = df["educ"].map(lkup_edu)
df["edu"] = np.where(df["edu"].isin(["Some college", "Associates"]), "Some college", df["edu"])
df[
= df[["abortion", "state", "eth", "male", "age", "edu", "caseid"]]
df return df.dropna()
= pd.read_csv("data/mr_p_statelevel_predictors.csv")
statelevel_predictors_df
= clean_df(cces_all_df)
cces_all_df cces_all_df.head()
abortion | state | eth | male | age | edu | caseid | |
---|---|---|---|---|---|---|---|
0 | 1.0 | MS | Other | -0.5 | 50-59 | Some college | 123464282 |
1 | 1.0 | WA | White | -0.5 | 40-49 | HS | 170169205 |
2 | 1.0 | RI | White | -0.5 | 60-69 | Some college | 175996005 |
3 | 0.0 | CO | Other | -0.5 | 70+ | Post-grad | 176818556 |
4 | 1.0 | MA | White | -0.5 | 40-49 | HS | 202120533 |
We will now show how estimates drawn from sample data (biased for whatever reasons of chance and circumstance) can be improved by using a post-stratification adjustment based on known facts about the size of the population in each strata considered in the model. This additional step is simply another modelling choice - another way to invest our model with information. In this manner the technique comes naturally in the Bayesian perspective.
Biased Sample
Consider a deliberately biased sample. Biased away from the census data and in this manner we show how to better recover population level estimates by incorporating details about the census population size across each of the stratum.
= cces_all_df.merge(statelevel_predictors_df, left_on="state", right_on="state", how="left")
cces_df "weight"] = (
cces_df[5 * cces_df["repvote"]
+ (cces_df["age"] == "18-29") * 0.5
+ (cces_df["age"] == "30-39") * 1
+ (cces_df["age"] == "40-49") * 2
+ (cces_df["age"] == "50-59") * 4
+ (cces_df["age"] == "60-69") * 6
+ (cces_df["age"] == "70+") * 8
+ (cces_df["male"] == 1) * 20
+ (cces_df["eth"] == "White") * 1.05
)
= cces_df.sample(5000, weights="weight", random_state=1000)
cces_df cces_df.head()
abortion | state | eth | male | age | edu | caseid | repvote | region | weight | |
---|---|---|---|---|---|---|---|---|---|---|
35171 | 0.0 | KY | White | -0.5 | 60-69 | HS | 415208636 | 0.656706 | South | 10.333531 |
5167 | 0.0 | NM | White | 0.5 | 60-69 | No HS | 412278020 | 0.453492 | West | 9.317460 |
52365 | 0.0 | OK | Hispanic | -0.5 | 30-39 | 4-Year College | 419467449 | 0.693047 | South | 4.465237 |
23762 | 1.0 | WV | White | -0.5 | 50-59 | Post-grad | 413757903 | 0.721611 | South | 8.658053 |
48197 | 0.0 | RI | White | 0.5 | 50-59 | 4-Year College | 417619385 | 0.416893 | Northeast | 7.134465 |
Visualise the Bias
Now we plot the outcome of expected shares within each demographic bucket across both the biased sample and the census data.
= """
mosaic ABCD
EEEE
"""
= plt.figure(layout="constrained", figsize=(20, 10))
fig = fig.subplot_mosaic(mosaic)
ax_dict
def plot_var(var, ax):
= (
a =False)[["abortion"]]
cces_df.groupby(var, observed
.mean()"abortion": "share"}, axis=1)
.rename({
.reset_index()
)= (
b =False)[["abortion"]]
cces_all_df.groupby(var, observed
.mean()"abortion": "share_census"}, axis=1)
.rename({
.reset_index()
)= a.merge(b).sort_values("share")
a
ax_dict[ax].vlines(a[var], a.share, a.share_census)="blue", label="Sample")
ax_dict[ax].scatter(a[var], a.share, color="red", label="Census")
ax_dict[ax].scatter(a[var], a.share_census, color"Proportion")
ax_dict[ax].set_ylabel(
"age", "A")
plot_var("edu", "B")
plot_var("male", "C")
plot_var("eth", "D")
plot_var("state", "E")
plot_var(
"E"].legend()
ax_dict[
"C"].set_xticklabels([])
ax_dict["C"].set_xlabel("Female / Male")
ax_dict["Comparison of Proportions: Survey Sample V Census", fontsize=20); plt.suptitle(
We can see here how the proportions differ markedly across the census report and our biased sample in how they represent the preferential votes with each strata. We now try and quantify the overall differences between the biased sample and the census report. We calculate the expected proportions in each dataset and their standard error.
def get_se_bernoulli(p, n):
return np.sqrt(p * (1 - p) / n)
= {
sample_cces_estimate "mean": np.mean(cces_df["abortion"].astype(float)),
"se": get_se_bernoulli(np.mean(cces_df["abortion"].astype(float)), len(cces_df)),
}
sample_cces_estimate
= {
sample_cces_all_estimate "mean": np.mean(cces_all_df["abortion"].astype(float)),
"se": get_se_bernoulli(np.mean(cces_all_df["abortion"].astype(float)), len(cces_all_df)),
}
sample_cces_all_estimate
= pd.DataFrame([sample_cces_all_estimate, sample_cces_estimate])
summary "data"] = ["Full Data", "Biased Data"]
summary[ summary
mean | se | data | |
---|---|---|---|
0 | 0.434051 | 0.002113 | Full Data |
1 | 0.465000 | 0.007054 | Biased Data |
A 3 percent difference in a national survey is a substantial error in the case where the difference is due to preventable bias.
Modelling the Data
To facilitate regression based stratification we first need a regression model. In our case we will ultimately fit a multi-level regression model with intercept terms for each for each of the groups in our demographic stratum. In this way we try to account for the appropriate set of variables (as in the example above) to better specify the effect modification due to membership within a particular demographic stratum.
We will fit the model using bambi
using the binomial link function on the biased sample data. But first we aggregate up by demographic strata and count the occurences within each strata.
= (
model_df "state", "eth", "male", "age", "edu"], observed=False)
cces_df.groupby(["caseid": "nunique", "abortion": "sum"})
.agg({
.reset_index()"abortion", ascending=False)
.sort_values("caseid": "n"}, axis=1)
.rename({="state", right_on="state", how="left")
.merge(statelevel_predictors_df, left_on
)"abortion"] = model_df["abortion"].astype(int)
model_df["n"] = model_df["n"].astype(int)
model_df[ model_df.head()
state | eth | male | age | edu | n | abortion | repvote | region | |
---|---|---|---|---|---|---|---|---|---|
0 | ID | White | -0.5 | 70+ | HS | 32 | 18 | 0.683102 | West |
1 | ID | White | 0.5 | 70+ | 4-Year College | 20 | 16 | 0.683102 | West |
2 | WV | White | 0.5 | 70+ | Some college | 17 | 13 | 0.721611 | South |
3 | WV | White | 0.5 | 70+ | 4-Year College | 15 | 12 | 0.721611 | South |
4 | ID | White | 0.5 | 70+ | Post-grad | 17 | 11 | 0.683102 | West |
Our model_df
now has one row per Strata across all the demographic cuts.
Fit Base Model to Biased Sample
Here we use some of bambi’s latest functionality to assess the interaction effects between the variables.
= """ p(abortion, n) ~ C(state) + C(eth) + C(edu) + male + repvote"""
formula
= bmb.Model(formula, model_df, family="binomial")
base_model
= base_model.fit(
result =100,
random_seed=0.95,
target_accept={"log_likelihood": True},
idata_kwargs )
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [Intercept, C(state), C(eth), C(edu), male, repvote]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 1325 seconds.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
We plot the predicted outcomes within each group using the plot_predictions
function.
= """
mosaic AABB
CCCC
"""
= plt.figure(layout="constrained", figsize=(20, 7))
fig = fig.subplot_mosaic(mosaic)
axs
"eth", ax=axs["A"])
bmb.interpret.plot_predictions(base_model, result, "edu", ax=axs["B"])
bmb.interpret.plot_predictions(base_model, result, "state", ax=axs["C"])
bmb.interpret.plot_predictions(base_model, result, "Plot Prediction per Class", fontsize=20); plt.suptitle(
Default computed for conditional variable: eth
Default computed for unspecified variable: edu, male, repvote, state
Default computed for conditional variable: edu
Default computed for unspecified variable: eth, male, repvote, state
Default computed for conditional variable: state
Default computed for unspecified variable: edu, eth, male, repvote
More interesting we can use the comparison functionality to compare differences in eth
conditional on age
and edu
. Where we can see that the differences between ethnicities are pretty stable across all age groups, slightly shifted by within the Post-grad
level of education.
= bmb.interpret.plot_comparisons(
fig, ax =base_model,
model=result,
idata={"eth": ["Black", "White"]},
contrast=["age", "edu"],
conditional="diff",
comparison_type={"main": "age", "group": "edu"},
subplot_kwargs={"figsize": (12, 5), "sharey": True},
fig_kwargs=True,
legend
)0].set_title("Comparison of Difference in Ethnicity \n within Age and Educational Strata"); ax[
Default computed for conditional variable: age, edu
Default computed for unspecified variable: male, repvote, state
We can pull these specific estimates out into a table for closer inspection to see that the differences in response expected between the extremes of educational attainment are moderated by state iand race.
bmb.interpret.comparisons(=base_model,
model=result,
idata={"edu": ["Post-grad", "No HS"]},
contrast={"eth": ["Black", "White"], "state": ["NY", "CA", "ID", "VA"]},
conditional="diff",
comparison_type )
Default computed for unspecified variable: male, repvote
term | estimate_type | value | eth | state | male | repvote | estimate | lower_3.0% | upper_97.0% | |
---|---|---|---|---|---|---|---|---|---|---|
0 | edu | diff | (Post-grad, No HS) | Black | CA | 0.0 | 0.530191 | 0.080885 | 0.000010 | 0.192397 |
1 | edu | diff | (Post-grad, No HS) | Black | ID | 0.0 | 0.530191 | 0.084249 | 0.000027 | 0.189721 |
2 | edu | diff | (Post-grad, No HS) | Black | NY | 0.0 | 0.530191 | 0.096051 | 0.000142 | 0.197739 |
3 | edu | diff | (Post-grad, No HS) | Black | VA | 0.0 | 0.530191 | 0.126502 | 0.017389 | 0.215645 |
4 | edu | diff | (Post-grad, No HS) | White | CA | 0.0 | 0.530191 | 0.079627 | 0.000004 | 0.190620 |
5 | edu | diff | (Post-grad, No HS) | White | ID | 0.0 | 0.530191 | 0.094141 | 0.000065 | 0.199827 |
6 | edu | diff | (Post-grad, No HS) | White | NY | 0.0 | 0.530191 | 0.094476 | 0.000149 | 0.199248 |
7 | edu | diff | (Post-grad, No HS) | White | VA | 0.0 | 0.530191 | 0.100414 | 0.005637 | 0.197609 |
With this in mind we want to fit our final model to incorporate the variation we see here across the different levels of our stratified data.
Fit Final Model to Biased Sample
We can specify these features of our model using a hierarchical structure as follows:
\[ Pr(y_i = 1) = \text{logit}^{-1}( \alpha_{\rm s[i]}^{\rm state} + \alpha_{\rm a[i]}^{\rm age} + \alpha_{\rm r[i]}^{\rm eth} + \alpha_{\rm e[i]}^{\rm edu} + \beta^{\rm male} \cdot {\rm Male}_{\rm i} + \alpha_{\rm g[i], r[i]}^{\rm male.eth} + \alpha_{\rm e[i], a[i]}^{\rm edu.age} + \alpha_{\rm e[i], r[i]}^{\rm edu.eth} ) \]
Here we have used the fact that we can add components to the \(\alpha\) intercept terms and interaction effects to express the stratum specific variation in the outcomes that we’ve seen in our exploratory work. Using the bambi
formula syntax. We have:
= "p(abortion, n) ~ male + repvote + (1 | state) + (1 | eth) + (1 | edu) + (1 | male:eth) + (1 | edu:age) + (1 | edu:eth)"
formula
= bmb.Model(formula, model_df, family="binomial")
model_hierarchical
= model_hierarchical.fit(
result =100,
random_seed=0.99,
target_accept#inference_method="nuts_numpyro",
#num_chains=4,
={"log_likelihood": True},
idata_kwargs )
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [Intercept, male, repvote, 1|state_sigma, 1|state_offset, 1|eth_sigma, 1|eth_offset, 1|edu_sigma, 1|edu_offset, 1|male:eth_sigma, 1|male:eth_offset, 1|edu:age_sigma, 1|edu:age_offset, 1|edu:eth_sigma, 1|edu:eth_offset]
/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
/home/tomas/anaconda3/envs/bambi-dev/lib/python3.11/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 4434 seconds.
Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
We recommend running at least 4 chains for robust computation of convergence diagnostics
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
result
-
<xarray.Dataset> Size: 2MB Dimensions: (chain: 2, draw: 1000, edu__factor_dim: 5, edu:age__factor_dim: 30, edu:eth__factor_dim: 20, eth__factor_dim: 4, male:eth__factor_dim: 8, state__factor_dim: 46) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 ... 995 996 997 998 999 * edu__factor_dim (edu__factor_dim) <U14 280B '4-Year College' ... 'S... * edu:age__factor_dim (edu:age__factor_dim) <U20 2kB '4-Year College:18-2... * edu:eth__factor_dim (edu:eth__factor_dim) <U23 2kB '4-Year College:Blac... * eth__factor_dim (eth__factor_dim) <U8 128B 'Black' ... 'White' * male:eth__factor_dim (male:eth__factor_dim) <U13 416B '-0.5:Black' ... '... * state__factor_dim (state__factor_dim) <U2 368B 'AK' 'AL' ... 'WV' 'WY' Data variables: (12/15) 1|edu (chain, draw, edu__factor_dim) float64 80kB 0.00455... 1|edu:age (chain, draw, edu:age__factor_dim) float64 480kB 0.... 1|edu:age_sigma (chain, draw) float64 16kB 0.07414 0.08386 ... 26.51 1|edu:eth (chain, draw, edu:eth__factor_dim) float64 320kB 0.... 1|edu:eth_sigma (chain, draw) float64 16kB 0.1437 0.1398 ... 6.855 1|edu_sigma (chain, draw) float64 16kB 0.2016 0.1045 ... 5.652 ... ... 1|male:eth_sigma (chain, draw) float64 16kB 0.3164 0.2029 ... 11.51 1|state (chain, draw, state__factor_dim) float64 736kB -0.1... 1|state_sigma (chain, draw) float64 16kB 0.2815 0.2636 ... 7.481 Intercept (chain, draw) float64 16kB 0.003819 0.412 ... -0.09506 male (chain, draw) float64 16kB 0.1091 0.04497 ... 0.799 repvote (chain, draw) float64 16kB -0.4383 -1.21 ... -0.618 Attributes: created_at: 2024-05-27T01:38:07.578066+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e sampling_time: 4434.483409166336 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 177MB Dimensions: (chain: 2, draw: 1000, __obs__: 11040) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999 * __obs__ (__obs__) int64 88kB 0 1 2 3 4 ... 11036 11037 11038 11039 Data variables: p(abortion, n) (chain, draw, __obs__) float64 177MB -3.202 -7.289 ... 0.0 Attributes: created_at: 2024-05-27T01:38:10.089314+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 252kB Dimensions: (chain: 2, draw: 1000) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 ... 995 996 997 998 999 Data variables: (12/17) acceptance_rate (chain, draw) float64 16kB 0.9993 0.9949 ... 0.9887 diverging (chain, draw) bool 2kB False False ... False False energy (chain, draw) float64 16kB 2.266e+03 ... 3.882e+04 energy_error (chain, draw) float64 16kB -0.001494 ... -0.001604 index_in_trajectory (chain, draw) int64 16kB -64 -93 -222 ... 718 640 69 largest_eigval (chain, draw) float64 16kB nan nan nan ... nan nan ... ... process_time_diff (chain, draw) float64 16kB 0.5792 0.5719 ... 2.126 reached_max_treedepth (chain, draw) bool 2kB False False ... True True smallest_eigval (chain, draw) float64 16kB nan nan nan ... nan nan step_size (chain, draw) float64 16kB 0.01291 ... 8.867e-05 step_size_bar (chain, draw) float64 16kB 0.01377 ... 0.0002591 tree_depth (chain, draw) int64 16kB 8 8 8 8 8 ... 10 10 10 10 10 Attributes: created_at: 2024-05-27T01:38:07.600018+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e sampling_time: 4434.483409166336 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 177kB Dimensions: (__obs__: 11040) Coordinates: * __obs__ (__obs__) int64 88kB 0 1 2 3 4 ... 11036 11037 11038 11039 Data variables: p(abortion, n) (__obs__) int64 88kB 18 16 13 12 11 11 11 ... 0 0 0 0 0 0 0 Attributes: created_at: 2024-05-27T01:38:07.605672+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
=["Intercept", "male", "1|edu", "1|eth", "repvote"]) az.summary(result, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 0.167 | 0.433 | -0.218 | 1.107 | 0.187 | 0.140 | 5.0 | 94.0 | 2.02 |
male | 0.499 | 0.330 | 0.010 | 0.799 | 0.215 | 0.188 | 3.0 | 20.0 | 2.19 |
1|edu[4-Year College] | 0.921 | 0.978 | -0.256 | 1.890 | 0.684 | 0.582 | 3.0 | 11.0 | 2.09 |
1|edu[HS] | 2.123 | 2.067 | -0.105 | 4.185 | 1.453 | 1.231 | 3.0 | 14.0 | 2.17 |
1|edu[No HS] | -1.513 | 1.686 | -3.191 | 0.423 | 1.183 | 1.004 | 3.0 | 11.0 | 2.11 |
1|edu[Post-grad] | -2.459 | 2.273 | -4.725 | 0.020 | 1.597 | 1.354 | 3.0 | 13.0 | 2.05 |
1|edu[Some college] | -0.528 | 0.566 | -1.079 | 0.207 | 0.390 | 0.335 | 3.0 | 50.0 | 1.97 |
1|eth[Black] | -5.782 | 5.378 | -11.687 | 0.006 | 3.783 | 3.201 | 3.0 | 11.0 | 2.23 |
1|eth[Hispanic] | 6.118 | 6.054 | -0.285 | 12.765 | 4.260 | 3.603 | 3.0 | 11.0 | 2.23 |
1|eth[Other] | -7.896 | 7.996 | -16.677 | 0.473 | 5.628 | 4.760 | 3.0 | 11.0 | 2.23 |
1|eth[White] | -1.228 | 1.422 | -2.758 | 0.510 | 0.986 | 0.843 | 3.0 | 11.0 | 2.23 |
repvote | -0.928 | 0.474 | -1.949 | -0.429 | 0.228 | 0.174 | 5.0 | 113.0 | 2.23 |
The terms in the model formula allow for specific intercept terms across the demographic splits of eth
, edu
, and state
. These represent stratum specific adjustments of the intercept term in the model. Similarly we invoke intercepts for the interaction terms of age:edu
, male:eth
and edu:eth
. Each of these cohorts represents a share of the data in our sample.
model_hierarchical.graph()
We then predict the outcomes implied by the biased sample. These predictions are to be adjusted by what we take to be the share of that demographic cohort in population. We can plot the posterior predictive distribution against the observed data from our biased sample to see that we have generally good fit to the distribution.
="response")
model_hierarchical.predict(result, kind= az.plot_ppc(result, figsize=(8, 5), kind="cumulative", observed_rug=True)
ax "Posterior Predictive Checks \n On Biased Sample"); ax.set_title(
Apply the Post-stratification Weighting
We now use the fitted model to predict the voting shares on the data where we use the genuine state numbers per strata. To do so we load data from the national census and augment our data set so as to be able to apply the appropriate weights.
= pd.read_csv("data/mr_p_poststrat_df.csv")
poststrat_df
= poststrat_df.merge(
new_data ="state", right_on="state", how="left"
statelevel_predictors_df, left_on
)"educ": "edu"}, axis=1, inplace=True)
new_data.rename({= model_df.merge(
new_data
new_data,="left",
how=["state", "eth", "male", "age", "edu"],
left_on=["state", "eth", "male", "age", "edu"],
right_on"n_y": "n", "repvote_y": "repvote"}, axis=1)[
).rename({"state", "eth", "male", "age", "edu", "n", "repvote"]
[
]
= new_data.merge(
new_data "state").agg({"n": "sum"}).reset_index().rename({"n": "state_total"}, axis=1)
new_data.groupby(
)"state_percent"] = new_data["n"] / new_data["state_total"]
new_data[ new_data.head()
state | eth | male | age | edu | n | repvote | state_total | state_percent | |
---|---|---|---|---|---|---|---|---|---|
0 | ID | White | -0.5 | 70+ | HS | 31503 | 0.683102 | 1193885 | 0.026387 |
1 | ID | White | 0.5 | 70+ | 4-Year College | 11809 | 0.683102 | 1193885 | 0.009891 |
2 | WV | White | 0.5 | 70+ | Some college | 17089 | 0.721611 | 1441882 | 0.011852 |
3 | WV | White | 0.5 | 70+ | 4-Year College | 7396 | 0.721611 | 1441882 | 0.005129 |
4 | ID | White | 0.5 | 70+ | Post-grad | 9873 | 0.683102 | 1193885 | 0.008270 |
This dataset is exactly the same structure and length as our input data to the fitted model. We have simply switched the observed counts across the demographic strata with the counts that reflect their proportion in the national survey. Additionally we have calculated the state totals and the share of each strata within the state. This will be important for later when we use this state_percent
variable to calculate an adjusted MrP estimate of the predictions at a state level. We now use this data set with our fitted model to generate posterior predictive distribution.
= model_hierarchical.predict(result, data=new_data, inplace=False, kind="response")
result_adjust result_adjust
-
<xarray.Dataset> Size: 179MB Dimensions: (chain: 2, draw: 1000, edu__factor_dim: 5, edu:age__factor_dim: 30, edu:eth__factor_dim: 20, eth__factor_dim: 4, male:eth__factor_dim: 8, state__factor_dim: 46, __obs__: 11040) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 ... 995 996 997 998 999 * edu__factor_dim (edu__factor_dim) <U14 280B '4-Year College' ... 'S... * edu:age__factor_dim (edu:age__factor_dim) <U20 2kB '4-Year College:18-2... * edu:eth__factor_dim (edu:eth__factor_dim) <U23 2kB '4-Year College:Blac... * eth__factor_dim (eth__factor_dim) <U8 128B 'Black' ... 'White' * male:eth__factor_dim (male:eth__factor_dim) <U13 416B '-0.5:Black' ... '... * state__factor_dim (state__factor_dim) <U2 368B 'AK' 'AL' ... 'WV' 'WY' * __obs__ (__obs__) int64 88kB 0 1 2 3 ... 11037 11038 11039 Data variables: (12/16) 1|edu (chain, draw, edu__factor_dim) float64 80kB 0.00455... 1|edu:age (chain, draw, edu:age__factor_dim) float64 480kB 0.... 1|edu:age_sigma (chain, draw) float64 16kB 0.07414 0.08386 ... 26.51 1|edu:eth (chain, draw, edu:eth__factor_dim) float64 320kB 0.... 1|edu:eth_sigma (chain, draw) float64 16kB 0.1437 0.1398 ... 6.855 1|edu_sigma (chain, draw) float64 16kB 0.2016 0.1045 ... 5.652 ... ... 1|state (chain, draw, state__factor_dim) float64 736kB -0.1... 1|state_sigma (chain, draw) float64 16kB 0.2815 0.2636 ... 7.481 Intercept (chain, draw) float64 16kB 0.003819 0.412 ... -0.09506 male (chain, draw) float64 16kB 0.1091 0.04497 ... 0.799 repvote (chain, draw) float64 16kB -0.4383 -1.21 ... -0.618 p (chain, draw, __obs__) float64 177MB 0.4237 ... 0.0... Attributes: created_at: 2024-05-27T01:38:07.578066+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e sampling_time: 4434.483409166336 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 177MB Dimensions: (chain: 2, draw: 1000, __obs__: 11040) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999 * __obs__ (__obs__) int64 88kB 0 1 2 3 4 ... 11036 11037 11038 11039 Data variables: p(abortion, n) (chain, draw, __obs__) int64 177MB 13482 5064 10850 ... 0 5 Attributes: modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 177MB Dimensions: (chain: 2, draw: 1000, __obs__: 11040) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999 * __obs__ (__obs__) int64 88kB 0 1 2 3 4 ... 11036 11037 11038 11039 Data variables: p(abortion, n) (chain, draw, __obs__) float64 177MB -3.202 -7.289 ... 0.0 Attributes: created_at: 2024-05-27T01:38:10.089314+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 252kB Dimensions: (chain: 2, draw: 1000) Coordinates: * chain (chain) int64 16B 0 1 * draw (draw) int64 8kB 0 1 2 3 4 5 ... 995 996 997 998 999 Data variables: (12/17) acceptance_rate (chain, draw) float64 16kB 0.9993 0.9949 ... 0.9887 diverging (chain, draw) bool 2kB False False ... False False energy (chain, draw) float64 16kB 2.266e+03 ... 3.882e+04 energy_error (chain, draw) float64 16kB -0.001494 ... -0.001604 index_in_trajectory (chain, draw) int64 16kB -64 -93 -222 ... 718 640 69 largest_eigval (chain, draw) float64 16kB nan nan nan ... nan nan ... ... process_time_diff (chain, draw) float64 16kB 0.5792 0.5719 ... 2.126 reached_max_treedepth (chain, draw) bool 2kB False False ... True True smallest_eigval (chain, draw) float64 16kB nan nan nan ... nan nan step_size (chain, draw) float64 16kB 0.01291 ... 8.867e-05 step_size_bar (chain, draw) float64 16kB 0.01377 ... 0.0002591 tree_depth (chain, draw) int64 16kB 8 8 8 8 8 ... 10 10 10 10 10 Attributes: created_at: 2024-05-27T01:38:07.600018+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e sampling_time: 4434.483409166336 tuning_steps: 1000 modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
-
<xarray.Dataset> Size: 177kB Dimensions: (__obs__: 11040) Coordinates: * __obs__ (__obs__) int64 88kB 0 1 2 3 4 ... 11036 11037 11038 11039 Data variables: p(abortion, n) (__obs__) int64 88kB 18 16 13 12 11 11 11 ... 0 0 0 0 0 0 0 Attributes: created_at: 2024-05-27T01:38:07.605672+00:00 arviz_version: 0.18.0 inference_library: pymc inference_library_version: 5.15.0+23.g19be124e modeling_interface: bambi modeling_interface_version: 0.13.1.dev39+gb7d6a6cb
Make Adjustments by State
We need to adjust each state specific strata by the weight appropriate for each state to post-stratify the estimates.To do so we extract the indices for each strata in our data on a state by state basis. Then we weight the predicted estimate by the appropriate percentage on a state basis and sum them to recover a state level estimate.
= []
estimates = az.extract(result, num_samples=2000)["p"]
abortion_posterior_base = az.extract(result_adjust, num_samples=2000)["p"]
abortion_posterior_mrp
for s in new_data["state"].unique():
= new_data.index[new_data["state"] == s].tolist()
idx = (
predicted_mrp ="sample") * new_data.iloc[idx]["state_percent"]))
((abortion_posterior_mrp[idx].mean(dimsum()
.
.item()
)= (
predicted_mrp_lb
(
(0.025, dim="sample")
abortion_posterior_mrp[idx].quantile(* new_data.iloc[idx]["state_percent"]
)
)sum()
.
.item()
)= (
predicted_mrp_ub
(
(0.975, dim="sample")
abortion_posterior_mrp[idx].quantile(* new_data.iloc[idx]["state_percent"]
)
)sum()
.
.item()
)= abortion_posterior_base[idx].mean().item()
predicted = abortion_posterior_base[idx].quantile(0.025).item()
base_lb = abortion_posterior_base[idx].quantile(0.975).item()
base_ub
estimates.append(
[s, predicted, base_lb, base_ub, predicted_mrp, predicted_mrp_ub, predicted_mrp_lb]
)
= pd.DataFrame(
state_predicted
estimates,=["state", "base_expected", "base_lb", "base_ub", "mrp_adjusted", "mrp_ub", "mrp_lb"],
columns
)
= (
state_predicted "state")[["abortion"]].mean().reset_index())
state_predicted.merge(cces_all_df.groupby("mrp_adjusted")
.sort_values("abortion": "census_share"}, axis=1)
.rename({
) state_predicted.head()
state | base_expected | base_lb | base_ub | mrp_adjusted | mrp_ub | mrp_lb | census_share | |
---|---|---|---|---|---|---|---|---|
36 | WY | 0.338274 | 2.440599e-22 | 1.0 | 0.378751 | 0.688924 | 0.110863 | 0.367188 |
9 | OK | 0.357949 | 3.722056e-19 | 1.0 | 0.413191 | 0.689851 | 0.139452 | 0.321553 |
41 | HI | 0.424572 | 5.972032e-22 | 1.0 | 0.415080 | 0.775743 | 0.124029 | 0.251969 |
37 | NJ | 0.405110 | 3.618629e-20 | 1.0 | 0.430793 | 0.719506 | 0.160311 | 0.426923 |
26 | NV | 0.406223 | 3.766937e-22 | 1.0 | 0.434659 | 0.756333 | 0.152533 | 0.581609 |
This was the crucial step and we’ll need to unpack it a little. We have taken (state by state) each demographic strata and reweighted the expected posterior predictive value by the share that strata represents in the national census within that state. We have then aggregated this score within the state to generate a state specific value. This value can now be compared to the expected value derived from our biased data and, more interestingly, the value reported in the national census.
Plot the Effect of Adjustment
These adjusted estimates can be plotted against the shares ascribed at the state level in the census. These adjustments provide a far better reflection of the national picture than the ones derived from model fitted to the biased sample.
= plt.subplots(2, 1, figsize=(17, 10))
fig, axs = axs.flatten()
axs = axs[0]
ax = axs[1]
ax1
ax.scatter("state"], state_predicted["base_expected"], color="red", label="Biased Sample"
state_predicted[
)
ax.scatter("state"],
state_predicted["mrp_adjusted"],
state_predicted[="slateblue",
color="Mr P Adjusted",
label
)
ax.scatter("state"],
state_predicted["census_share"],
state_predicted[="darkgreen",
color="Census Aggregates",
label
)
ax.legend()
ax.vlines("state"],
state_predicted["mrp_adjusted"],
state_predicted["census_share"],
state_predicted[="black",
color="--",
linestyles
)
ax1.scatter("state"], state_predicted["base_expected"], color="red", label="Biased Sample"
state_predicted[
)
ax1.scatter("state"],
state_predicted["mrp_adjusted"],
state_predicted[="slateblue",
color="Mr P Adjusted",
label
)
ax1.legend()
ax1.vlines("state"], state_predicted["base_ub"], state_predicted["base_lb"], color="red"
state_predicted[
)
ax1.vlines("state"],
state_predicted["mrp_ub"],
state_predicted["mrp_lb"],
state_predicted[="slateblue",
color
)"State")
ax.set_xlabel("Proportion")
ax.set_ylabel(
ax1.set_title("Comparison of Uncertainty in Biased Predictions and Post-stratified Adjustment", fontsize=15
)"Comparison of Post-stratified Adjustment and Census Report", fontsize=15)
ax.set_title("Proportion"); ax1.set_ylabel(
In the top plot here we see the state specific MrP estimates for the proportion voting yes, compared to the estimate inferred from the biased sample and estimates from the national census. We can see how the MrP estimates are much closer to those drawn from the national census.
In the below plot we’ve shown the estimates from the MrP model and the estimates drawn from the biased sample, but here we’ve shown the uncertainty in the estimation on a state level. Clearly, the MrP adjustments also shrinks the uncertainty in our estimate of vote-share.
MrP is in this sense a corrective procedure for the avoidance of bias in sample data, where we have strong evidence for adjusting the weight accorded to any stratum of data in our population.
Conclusion
In this notebook we have seen how to use bambi
to concisely and quickly apply the technique of multilevel regression and post-stratification. We’ve seen how this technique is a natural and compelling extension to regression modelling in general, that incorporates prior knowledge in an interesting and flexible manner.
The problems of representation in data are serious. Policy gets made and changed on the basis of anticipated policy effects. Without the ability to control and adjust for non-representative samples, politicians and policy makers risk prioritising initiatives for a vocal majority among the represented in the sample. The question of whether a given sample is “good” or “bad” cannot (at the time) ever be known, so some care needs to be taken when choosing to adjust your model of the data.
Predictions made from sample data are consequential. It’s not even an exaggeration to say that the fates of entire nations can hang on decisions made from poorly understood sampling procedures. Multilevel regression and post-stratification is an apt tool for making the adjustments required and guiding decisions makers in crucial policy choices, but it should be used carefully.