{
"cells": [
{
"cell_type": "markdown",
"id": "editorial-support",
"metadata": {},
"source": [
"# Getting started"
]
},
{
"cell_type": "markdown",
"id": "collectible-gather",
"metadata": {},
"source": [
"formulae requires a working Python interpreter (3.7+) and the libraries NumPy, SciPy and pandas with versions specified in the [requirements.txt](https://github.com/bambinos/formulae/blob/master/requirements.txt) file.\n",
"\n",
"Assuming a standard Python environment is installed on your machine (including pip), the latest release of formulae can be installed in one line using pip:\n",
"\n",
"`pip install formulae`\n",
"\n",
"Alternatively, if you want the development version of the package you can install from GitHub:\n",
"\n",
"`pip install git+https://github.com/bambinos/formulae.git`"
]
},
{
"cell_type": "markdown",
"id": "emerging-april",
"metadata": {},
"source": [
"## User guide\n",
"\n",
"The main function is `design_matrices()`. It takes a model formula and a pandas DataFrame and returns an object of class `DesignMatrices` that contains information about the response, the common effects, and the group specific effects that can be accessed with the attributes `.response`, `.common`, and `.group` respectively. \n",
"\n",
"\n",
"Model formulas are much like the very famous model formulas in the R programming language. The extension to group-specific effects is similar to the extension provided by the [lme4](https://CRAN.R-project.org/package=lme4) package, also in R."
]
},
{
"cell_type": "markdown",
"id": "fatty-dividend",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "checked-moisture",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from formulae import design_matrices"
]
},
{
"cell_type": "markdown",
"id": "functioning-petroleum",
"metadata": {},
"source": [
"Let's simulate some data to use throughout examples here. The number of observations isn't too important. We keep it low to understand what is going on more easily. However, we do include both numeric and categoric variables to see how formulae handles these types in different scenarios."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "scientific-armenia",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" y1 | \n",
" y2 | \n",
" x | \n",
" z | \n",
" g | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.200521 | \n",
" C | \n",
" -0.351447 | \n",
" 3 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.421306 | \n",
" B | \n",
" -0.431129 | \n",
" 3 | \n",
" Group 1 | \n",
"
\n",
" \n",
" 2 | \n",
" -0.490128 | \n",
" B | \n",
" -0.412738 | \n",
" 2 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 3 | \n",
" -0.929507 | \n",
" B | \n",
" 0.574337 | \n",
" 3 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.515838 | \n",
" A | \n",
" 1.561390 | \n",
" 3 | \n",
" Group 1 | \n",
"
\n",
" \n",
" 5 | \n",
" -1.180238 | \n",
" A | \n",
" -0.357137 | \n",
" 1 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 6 | \n",
" -1.839936 | \n",
" C | \n",
" -0.453278 | \n",
" 3 | \n",
" Group 3 | \n",
"
\n",
" \n",
" 7 | \n",
" -0.330336 | \n",
" B | \n",
" -0.039111 | \n",
" 3 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 8 | \n",
" -0.797019 | \n",
" A | \n",
" 1.882748 | \n",
" 2 | \n",
" Group 2 | \n",
"
\n",
" \n",
" 9 | \n",
" -0.030036 | \n",
" A | \n",
" 0.549615 | \n",
" 3 | \n",
" Group 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" y1 y2 x z g\n",
"0 0.200521 C -0.351447 3 Group 2\n",
"1 -0.421306 B -0.431129 3 Group 1\n",
"2 -0.490128 B -0.412738 2 Group 2\n",
"3 -0.929507 B 0.574337 3 Group 2\n",
"4 1.515838 A 1.561390 3 Group 1\n",
"5 -1.180238 A -0.357137 1 Group 2\n",
"6 -1.839936 C -0.453278 3 Group 3\n",
"7 -0.330336 B -0.039111 3 Group 2\n",
"8 -0.797019 A 1.882748 2 Group 2\n",
"9 -0.030036 A 0.549615 3 Group 3"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rng = np.random.default_rng(7355608)\n",
"SIZE = 10\n",
"data = pd.DataFrame(\n",
" {\n",
" \"y1\": rng.normal(size=SIZE),\n",
" \"y2\": rng.choice([\"A\", \"B\", \"C\"], size=SIZE),\n",
" \"x\": rng.normal(size=SIZE),\n",
" \"z\": rng.choice([1, 2, 3], size=SIZE),\n",
" \"g\": rng.choice([\"Group 1\", \"Group 2\", \"Group 3\"], size=SIZE),\n",
" }\n",
")\n",
"data"
]
},
{
"cell_type": "markdown",
"id": "bronze-diving",
"metadata": {},
"source": [
"## Creating and accessing design matrices"
]
},
{
"cell_type": "markdown",
"id": "geological-reserve",
"metadata": {},
"source": [
"Let's create a `DesignMatrices` object with a numeric response and a numeric predictor. The first argument we pass is the model formula, and the second is the data frame where we take variables from."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "conservative-vision",
"metadata": {},
"outputs": [],
"source": [
"dm = design_matrices(\"y1 ~ x\", data)"
]
},
{
"cell_type": "markdown",
"id": "requested-auction",
"metadata": {},
"source": [
"Under `dm.common` we find an object of class `CommonEffectsMatrix` which stores the design matrix for the common part of the model, as well as other information about the terms in the matrix."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "accepted-electronics",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CommonEffectsMatrix with shape (10, 2)\n",
"Terms: \n",
" Intercept \n",
" kind: intercept\n",
" column: 0\n",
" x \n",
" kind: numeric\n",
" column: 1\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.common"
]
},
{
"cell_type": "markdown",
"id": "banned-timothy",
"metadata": {},
"source": [
"Use `.design_matrix` to obtain the design matrix for the common part of the model."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "healthy-substitute",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1. , -0.351447 ],\n",
" [ 1. , -0.43112895],\n",
" [ 1. , -0.41273793],\n",
" [ 1. , 0.57433652],\n",
" [ 1. , 1.56139001],\n",
" [ 1. , -0.35713698],\n",
" [ 1. , -0.45327849],\n",
" [ 1. , -0.03911089],\n",
" [ 1. , 1.88274763],\n",
" [ 1. , 0.54961543]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.common.design_matrix"
]
},
{
"cell_type": "markdown",
"id": "90ef5349",
"metadata": {},
"source": [
"Or `np.array(dm.common)`"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "bf9039c9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 1. , -0.351447 ],\n",
" [ 1. , -0.43112895],\n",
" [ 1. , -0.41273793],\n",
" [ 1. , 0.57433652],\n",
" [ 1. , 1.56139001],\n",
" [ 1. , -0.35713698],\n",
" [ 1. , -0.45327849],\n",
" [ 1. , -0.03911089],\n",
" [ 1. , 1.88274763],\n",
" [ 1. , 0.54961543]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array(dm.common)"
]
},
{
"cell_type": "markdown",
"id": "professional-pride",
"metadata": {},
"source": [
"Or call the `.as_dataframe()` method to return a data frame."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "egyptian-plate",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.351447 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.431129 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.412738 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.561390 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept x\n",
"0 1.0 -0.351447\n",
"1 1.0 -0.431129\n",
"2 1.0 -0.412738\n",
"3 1.0 0.574337\n",
"4 1.0 1.561390"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "f41c55db",
"metadata": {},
"source": [
"Note the response term can be omitted and we still obtain the design matrix for the common predictors"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3fb94076",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.351447 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.431129 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.412738 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.561390 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept x\n",
"0 1.0 -0.351447\n",
"1 1.0 -0.431129\n",
"2 1.0 -0.412738\n",
"3 1.0 0.574337\n",
"4 1.0 1.561390"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"x\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "temporal-wisconsin",
"metadata": {},
"source": [
"## Categorical predictors\n",
"\n",
"Variables with categorical type in the data frame are recognized and handled as such. If there's a numerical variable that should be interpreted as categorical and you don't want to manipulate your data manually, you can wrap the name of the variable with `C()`, like `C(variable)`, and formulae will convert it to the categorical type.\n",
"\n",
"Categorical predictors are encoded using reference level encoding (i.e. the first level is taken as baseline). This way, the resulting design matrix is of full-rank."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "scheduled-repair",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" g[Group 2] | \n",
" g[Group 3] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept g[Group 2] g[Group 3]\n",
"0 1.0 1.0 0.0\n",
"1 1.0 0.0 0.0\n",
"2 1.0 1.0 0.0\n",
"3 1.0 1.0 0.0\n",
"4 1.0 0.0 0.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"g\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "passing-shopper",
"metadata": {},
"source": [
"But if we don't have an intercept term, there's no need to use reference encoding for the categorical variables to avoid linear dependencies between the columns and formulae uses cell-means encoding."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "exciting-ultimate",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" g[Group 1] | \n",
" g[Group 2] | \n",
" g[Group 3] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" g[Group 1] g[Group 2] g[Group 3]\n",
"0 0 1 0\n",
"1 1 0 0\n",
"2 0 1 0\n",
"3 0 1 0\n",
"4 1 0 0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"0 + g\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "after-label",
"metadata": {},
"source": [
"Suppose that `z` actually represents a certain categorization represented by numbers. We can use `\"y1 ~ C(z)\"` and `z` will be internally interpreted as categorical instead of numeric."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "final-lucas",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" C(z)[2] | \n",
" C(z)[3] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept C(z)[2] C(z)[3]\n",
"0 1.0 0.0 1.0\n",
"1 1.0 0.0 1.0\n",
"2 1.0 1.0 0.0\n",
"3 1.0 0.0 1.0\n",
"4 1.0 0.0 1.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"C(z)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "sought-cover",
"metadata": {},
"source": [
"By default, the `C()` wrapper takes the first level as the reference level (after being sorted with `sorted()`), but we can override this by passing a list with a custom ordering. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "sustained-orleans",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" C(z, levels = lvl)[3] | \n",
" C(z, levels = lvl)[1] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept C(z, levels = lvl)[3] C(z, levels = lvl)[1]\n",
"0 1.0 1.0 0.0\n",
"1 1.0 1.0 0.0\n",
"2 1.0 0.0 0.0\n",
"3 1.0 1.0 0.0\n",
"4 1.0 1.0 0.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 2 is taken as baseline, and then it comes 3 and 1.\n",
"lvl = [2, 3, 1]\n",
"dm = design_matrices(\"C(z, levels=lvl)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "03178207",
"metadata": {},
"source": [
"Finally, if we want to convert one variable into a binary variable, we can use the internal `binary()` function:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "2689c46e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" binary(z, 2) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept binary(z, 2)\n",
"0 1 0\n",
"1 1 0\n",
"2 1 1\n",
"3 1 0\n",
"4 1 0"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"binary(z, 2)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "aa87b287",
"metadata": {},
"source": [
"which also works for categorical variables"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "4dc247ed",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" binary(g, Group 1) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept binary(g, Group 1)\n",
"0 1 0\n",
"1 1 1\n",
"2 1 0\n",
"3 1 0\n",
"4 1 1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"binary(g, 'Group 1')\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "perfect-roller",
"metadata": {},
"source": [
"## Adding interactions\n",
"\n",
"The interaction operator is `:`."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "convertible-tactics",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" x:z | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -1.054341 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -1.293387 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.825476 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.723010 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 4.684170 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept x:z\n",
"0 1.0 -1.054341\n",
"1 1.0 -1.293387\n",
"2 1.0 -0.825476\n",
"3 1.0 1.723010\n",
"4 1.0 4.684170"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"x:z\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "happy-lobby",
"metadata": {},
"source": [
"The `*` operator is known as the full interaction operator because `x*z` is equivalent to `x + z + x:z`"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "impressive-attention",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" x | \n",
" z | \n",
" x:z | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.351447 | \n",
" 3.0 | \n",
" -1.054341 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.431129 | \n",
" 3.0 | \n",
" -1.293387 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.412738 | \n",
" 2.0 | \n",
" -0.825476 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.574337 | \n",
" 3.0 | \n",
" 1.723010 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.561390 | \n",
" 3.0 | \n",
" 4.684170 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept x z x:z\n",
"0 1.0 -0.351447 3.0 -1.054341\n",
"1 1.0 -0.431129 3.0 -1.293387\n",
"2 1.0 -0.412738 2.0 -0.825476\n",
"3 1.0 0.574337 3.0 1.723010\n",
"4 1.0 1.561390 3.0 4.684170"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"x*z\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "greater-scale",
"metadata": {},
"source": [
"And both interaction operators can be used with mixtures of numerical and categorical variables."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "liberal-bandwidth",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
" g[Group 1] | \n",
" g[Group 2] | \n",
" g[Group 3] | \n",
" x:g[Group 2] | \n",
" x:g[Group 3] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.351447 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" -0.351447 | \n",
" -0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.431129 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" -0.000000 | \n",
" -0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" -0.412738 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" -0.412738 | \n",
" -0.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.574337 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.574337 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.561390 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x g[Group 1] g[Group 2] g[Group 3] x:g[Group 2] x:g[Group 3]\n",
"0 -0.351447 0.0 1.0 0.0 -0.351447 -0.0\n",
"1 -0.431129 1.0 0.0 0.0 -0.000000 -0.0\n",
"2 -0.412738 0.0 1.0 0.0 -0.412738 -0.0\n",
"3 0.574337 0.0 1.0 0.0 0.574337 0.0\n",
"4 1.561390 1.0 0.0 0.0 0.000000 0.0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"0 + x*g\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "filled-wichita",
"metadata": {},
"source": [
"## Function calls\n",
"\n",
"A term does not need to be just a variable in a data frame. It can also be the result of a function call. Formulae allows you to built-in functions as well as functions available from the top-level environment where `design_matrices` is being called.\n",
"\n",
"Since we've loaded the NumPy library as `np`, we can access its namespace as we would do in regular Python code."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "angry-consistency",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" np.exp(x) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.703669 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.649775 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 0.661836 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 1.775952 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 4.765441 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept np.exp(x)\n",
"0 1.0 0.703669\n",
"1 1.0 0.649775\n",
"2 1.0 0.661836\n",
"3 1.0 1.775952\n",
"4 1.0 4.765441"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"np.exp(x)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "educated-kazakhstan",
"metadata": {},
"source": [
"We can also use our own custom functions"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "voluntary-compatibility",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" add(x, 10) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 9.648553 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 9.568871 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 9.587262 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 10.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 11.561390 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept add(x, 10)\n",
"0 1.0 9.648553\n",
"1 1.0 9.568871\n",
"2 1.0 9.587262\n",
"3 1.0 10.574337\n",
"4 1.0 11.561390"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def add(x, how_much):\n",
" return x + how_much\n",
"\n",
"dm = design_matrices(\"add(x, 10)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "checked-corrections",
"metadata": {},
"source": [
"Or even nested function calls!"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "ready-application",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" add(np.exp(x), 5) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 5.703669 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 5.649775 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 5.661836 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 6.775952 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 9.765441 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept add(np.exp(x), 5)\n",
"0 1.0 5.703669\n",
"1 1.0 5.649775\n",
"2 1.0 5.661836\n",
"3 1.0 6.775952\n",
"4 1.0 9.765441"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"add(np.exp(x), 5)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "reported-conditioning",
"metadata": {},
"source": [
"## Built-in transformations\n",
"\n",
"Formulae comes with some built-in transformations that come very handy when doing statistical modeling. These functions are\n",
"\n",
"* `center()`: Centers the variable around its mean.\n",
"* `scale()`: Centers the variable around its mean and divides it by its standard deviation.\n",
"* `standardize()`: An alias for `scale()`.\n",
"* `bs()`: Create a Basis Spline, a.k.a. B-Spline. "
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "flexible-watershed",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" scale(x) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.732600 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.829284 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.806969 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.390720 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.588384 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept scale(x)\n",
"0 1.0 -0.732600\n",
"1 1.0 -0.829284\n",
"2 1.0 -0.806969\n",
"3 1.0 0.390720\n",
"4 1.0 1.588384"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"scale(x)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "lovely-arlington",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.0\n",
"-2.2204460492503132e-17\n"
]
}
],
"source": [
"# check mean is 0 and std is 1\n",
"print(np.std(dm.common.as_dataframe()[\"scale(x)\"]))\n",
"print(np.mean(dm.common.as_dataframe()[\"scale(x)\"]))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "2567616c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" bs(x, df = 6)[0] | \n",
" bs(x, df = 6)[1] | \n",
" bs(x, df = 6)[2] | \n",
" bs(x, df = 6)[3] | \n",
" bs(x, df = 6)[4] | \n",
" bs(x, df = 6)[5] | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 0.281092 | \n",
" 0.655237 | \n",
" 0.063433 | \n",
" 0.000237 | \n",
" 0.000000 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 0.703742 | \n",
" 0.086820 | \n",
" 0.000757 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 0.737838 | \n",
" 0.240873 | \n",
" 0.004644 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.359380 | \n",
" 0.502344 | \n",
" 0.138276 | \n",
" 1.039084e-07 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.005325 | \n",
" 0.092999 | \n",
" 0.470374 | \n",
" 4.313026e-01 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" 0.312950 | \n",
" 0.631876 | \n",
" 0.055013 | \n",
" 0.000161 | \n",
" 0.000000 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.296984 | \n",
" 0.611161 | \n",
" 0.090700 | \n",
" 0.001155 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
" 8 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.000000e+00 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 0.000000 | \n",
" 0.000008 | \n",
" 0.380123 | \n",
" 0.494494 | \n",
" 0.125375 | \n",
" 0.000000e+00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept bs(x, df = 6)[0] bs(x, df = 6)[1] bs(x, df = 6)[2] \\\n",
"0 1.0 0.281092 0.655237 0.063433 \n",
"1 1.0 0.703742 0.086820 0.000757 \n",
"2 1.0 0.737838 0.240873 0.004644 \n",
"3 1.0 0.000000 0.000000 0.359380 \n",
"4 1.0 0.000000 0.000000 0.005325 \n",
"5 1.0 0.312950 0.631876 0.055013 \n",
"6 1.0 0.000000 0.000000 0.000000 \n",
"7 1.0 0.000000 0.296984 0.611161 \n",
"8 1.0 0.000000 0.000000 0.000000 \n",
"9 1.0 0.000000 0.000008 0.380123 \n",
"\n",
" bs(x, df = 6)[3] bs(x, df = 6)[4] bs(x, df = 6)[5] \n",
"0 0.000237 0.000000 0.000000e+00 \n",
"1 0.000000 0.000000 0.000000e+00 \n",
"2 0.000000 0.000000 0.000000e+00 \n",
"3 0.502344 0.138276 1.039084e-07 \n",
"4 0.092999 0.470374 4.313026e-01 \n",
"5 0.000161 0.000000 0.000000e+00 \n",
"6 0.000000 0.000000 0.000000e+00 \n",
"7 0.090700 0.001155 0.000000e+00 \n",
"8 0.000000 0.000000 1.000000e+00 \n",
"9 0.494494 0.125375 0.000000e+00 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"bs(x, df=6)\", data)\n",
"dm.common.as_dataframe()"
]
},
{
"cell_type": "markdown",
"id": "experienced-baghdad",
"metadata": {},
"source": [
"These transformations are known as **stateful transformations** because they remember the original values of parameters involved in the transformation (the mean and the standard deviation). This feature is critical when one wants to derive a design matrix based on new data, as required when evaluating the fit of a model on a new dataset."
]
},
{
"cell_type": "markdown",
"id": "genuine-engineering",
"metadata": {},
"source": [
"### Suppresing the default intercept\n",
"\n",
"By default, formulae includes an intercept in the resulting desing matrix. But this can be ommited. Just add a `0` to the model formula as in the following example. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "sharp-simple",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -0.351447 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.431129 | \n",
"
\n",
" \n",
" 2 | \n",
" -0.412738 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.561390 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" x\n",
"0 -0.351447\n",
"1 -0.431129\n",
"2 -0.412738\n",
"3 0.574337\n",
"4 1.561390"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"0 + x\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "progressive-physics",
"metadata": {},
"source": [
"`\"y1 ~ -1 + x\"` and `\"y1 ~ x - 1\"` are equivalent alternatives."
]
},
{
"cell_type": "markdown",
"id": "british-mount",
"metadata": {},
"source": [
"## Operations between terms\n",
"\n",
"It is possible someone wants to use a term that is the result of adding two other terms, such as `x + z`. However, if we do `\"y ~ x + z\"`, formulae will understand we want a model with two predictor terms, `x`, and `z`. Fortunately, there are two (equivalent) ways of telling formulae that we want a term that is the result of performing operations between other terms. This can be achieved either by using the `I()` wrapper or its shorthand `{}`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "occupied-times",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" I(x + z) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" 2.648553 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" 2.568871 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" 1.587262 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 3.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 4.561390 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept I(x + z)\n",
"0 1.0 2.648553\n",
"1 1.0 2.568871\n",
"2 1.0 1.587262\n",
"3 1.0 3.574337\n",
"4 1.0 4.561390"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"I(x + z)\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "seasonal-dollar",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" I(x / z) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.117149 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.143710 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.206369 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.191446 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 0.520463 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept I(x / z)\n",
"0 1.0 -0.117149\n",
"1 1.0 -0.143710\n",
"2 1.0 -0.206369\n",
"3 1.0 0.191446\n",
"4 1.0 0.520463"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"{x / z}\", data)\n",
"dm.common.as_dataframe().head()"
]
},
{
"cell_type": "markdown",
"id": "changing-latter",
"metadata": {},
"source": [
"As you may have noticed, the output shows `I()` instead of `{}`. That's because `{}` is translated into `I()` internally."
]
},
{
"cell_type": "markdown",
"id": "exciting-azerbaijan",
"metadata": {},
"source": [
"## Categorical responses\n",
"\n",
"Formulae can work with categorical responses with 2 or more levels. When the variable contains only two levels, formulae flags it as binary.\n",
"\n",
"Recall `y2` is a categorical variable with levels `A`, `B` and `C`. In this case, formulae set the `binary` flag to `False` since this variable contains more than two categories."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "blocked-configuration",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ResponseVector \n",
" name: y2\n",
" kind: categoric\n",
" length: 10\n",
" binary: False\n",
" levels: ['A', 'B', 'C']\n",
"\n",
"To access the actual design vector do 'np.array(this_obj)'"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"y2 ~ x\", data)\n",
"dm.response"
]
},
{
"cell_type": "markdown",
"id": "9c604a55",
"metadata": {},
"source": [
"The `.design_vector` array is a dummy based design matrix where each column represents a level of the response"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "julian-viking",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 0, 1],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [1, 0, 0],\n",
" [1, 0, 0],\n",
" [0, 0, 1],\n",
" [0, 1, 0],\n",
" [1, 0, 0],\n",
" [1, 0, 0]])"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.response.design_vector"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "24395f20",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 0, 1],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [1, 0, 0],\n",
" [1, 0, 0],\n",
" [0, 0, 1],\n",
" [0, 1, 0],\n",
" [1, 0, 0],\n",
" [1, 0, 0]])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array(dm.response)"
]
},
{
"cell_type": "markdown",
"id": "amateur-contest",
"metadata": {},
"source": [
"It is also possible to override the default behavior and indicate the reference level with some syntax sugar. If you say `response[level]`, formulae will interpret `response` as a binary variable and set the `'level'` category as reference. For example"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "social-bahrain",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ResponseVector \n",
" name: y2\n",
" kind: categoric\n",
" length: 10\n",
" binary: True\n",
" success: B\n",
"\n",
"To access the actual design vector do 'np.array(this_obj)'"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"y2[B] ~ x\", data)\n",
"dm.response"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "rapid-deposit",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 'y2' is equal to B in four cases\n",
"np.array(dm.response)"
]
},
{
"cell_type": "markdown",
"id": "2168f2f3",
"metadata": {},
"source": [
"If it is the case that the reference level contains spaces, we can wrap the response within quotes:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "36be2de7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ResponseVector \n",
" name: g\n",
" kind: categoric\n",
" length: 10\n",
" binary: True\n",
" success: Group 3\n",
"\n",
"To access the actual design vector do 'np.array(this_obj)'"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"g['Group 3'] ~ x\", data)\n",
"dm.response"
]
},
{
"cell_type": "markdown",
"id": "e11ea741",
"metadata": {},
"source": [
"As with the predictors, the `binary()` function can be used on the LHS as well"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "8cd32b33",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ResponseVector \n",
" name: binary(g, Group 3)\n",
" kind: numeric\n",
" length: 10\n",
"\n",
"To access the actual design vector do 'np.array(this_obj)'"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"binary(g, 'Group 3') ~ x\", data)\n",
"dm.response"
]
},
{
"cell_type": "markdown",
"id": "eb3beadd",
"metadata": {},
"source": [
"Note `binary()` outputs a numeric variable, so the response is recognized as numeric instead of categoric."
]
},
{
"cell_type": "markdown",
"id": "cognitive-tablet",
"metadata": {},
"source": [
"## Group-specific effects\n",
"\n",
"formulae also handles group-specific effects. They are specified using the pipe operator `|`, similar to the [lme4](https://CRAN.R-project.org/package=lme4) R package. These terms are of the form `(expr|factor)` where the `expr` represents the effect and `factor` the group. [This vignette](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) from lme4 provides much better information on how this notation works. In a nutshell, group-specific intercepts are specified via `(1|group)` and group-specific slopes are specified via `(variable|group)`. The latter also adds a group-specific intercept by default. This can be overriden doing `(0 + variable|group)`.\n",
"\n",
"First, let's see a formula specifying a common predictor, `x`, and a group-specific intercept for each level of `g`."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "indirect-hanging",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GroupEffectsMatrix with shape (10, 3)\n",
"Terms: \n",
" 1|g \n",
" kind: intercept\n",
" groups: ['Group 1', 'Group 2', 'Group 3']\n",
" columns: 0:3\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"y1 ~ x + (1|g)\", data)\n",
"dm.group"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "bronze-ethnic",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" x | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -0.351447 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -0.431129 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.412738 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" 0.574337 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" 1.561390 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" -0.357137 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" -0.453278 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" -0.039111 | \n",
"
\n",
" \n",
" 8 | \n",
" 1.0 | \n",
" 1.882748 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" 0.549615 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept x\n",
"0 1.0 -0.351447\n",
"1 1.0 -0.431129\n",
"2 1.0 -0.412738\n",
"3 1.0 0.574337\n",
"4 1.0 1.561390\n",
"5 1.0 -0.357137\n",
"6 1.0 -0.453278\n",
"7 1.0 -0.039111\n",
"8 1.0 1.882748\n",
"9 1.0 0.549615"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.common.as_dataframe()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "material-strength",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 1, 0],\n",
" [1, 0, 0],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [1, 0, 0],\n",
" [0, 1, 0],\n",
" [0, 0, 1],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [0, 0, 1]])"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array(dm.group)"
]
},
{
"cell_type": "markdown",
"id": "accompanied-conviction",
"metadata": {},
"source": [
"If we now use `(x|g)` we get both the group-specific slope, as well as the intercept, because it is included by default."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "parliamentary-access",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GroupEffectsMatrix with shape (10, 6)\n",
"Terms: \n",
" 1|g \n",
" kind: intercept\n",
" groups: ['Group 1', 'Group 2', 'Group 3']\n",
" columns: 0:3\n",
" x|g \n",
" kind: numeric\n",
" groups: ['Group 1', 'Group 2', 'Group 3']\n",
" columns: 3:6\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"y1 ~ x + (x|g)\", data)\n",
"dm.group"
]
},
{
"cell_type": "markdown",
"id": "independent-sitting",
"metadata": {},
"source": [
"We can access the different sub-matrices corresponding to each term as follows"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "fifth-grass",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 1., 0.],\n",
" [1., 0., 0.],\n",
" [0., 1., 0.],\n",
" [0., 1., 0.],\n",
" [1., 0., 0.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.],\n",
" [0., 1., 0.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.]])"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.group[\"1|g\"] # same as before"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "impressive-repeat",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0. , -0.351447 , -0. ],\n",
" [-0.43112895, -0. , -0. ],\n",
" [-0. , -0.41273793, -0. ],\n",
" [ 0. , 0.57433652, 0. ],\n",
" [ 1.56139001, 0. , 0. ],\n",
" [-0. , -0.35713698, -0. ],\n",
" [-0. , -0. , -0.45327849],\n",
" [-0. , -0.03911089, -0. ],\n",
" [ 0. , 1.88274763, 0. ],\n",
" [ 0. , 0. , 0.54961543]])"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm.group[\"x|g\"]"
]
},
{
"cell_type": "markdown",
"id": "tutorial-movement",
"metadata": {},
"source": [
"But as with the common part, this default intercept can be removed"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "stainless-scottish",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GroupEffectsMatrix with shape (10, 3)\n",
"Terms: \n",
" x|g \n",
" kind: numeric\n",
" groups: ['Group 1', 'Group 2', 'Group 3']\n",
" columns: 0:3\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dm = design_matrices(\"y1 ~ x + (0 + x|g)\", data)\n",
"dm.group # note there is no 1|x in the terms"
]
},
{
"cell_type": "markdown",
"id": "great-timer",
"metadata": {},
"source": [
"As a final remark, we note that the `expr` part of a group-specific can also be a categorical variable or an interaction. Similar rules apply for the `factor` part."
]
},
{
"cell_type": "markdown",
"id": "complete-lebanon",
"metadata": {},
"source": [
"## Evaluating new data"
]
},
{
"cell_type": "markdown",
"id": "departmental-federation",
"metadata": {},
"source": [
"Both `CommonEffectsMatrix` and `GroupEffectsMatrix` are provided with a method that receives a data frame, with the same structure as the original data frame, and returns a new design matrix.\n",
"\n",
"Suppose we have measured three variables, `y`, `x` and `g` on some subjects. `y` and `x` are numerics, while `g` is a categorical variable that represents some grouping. We also have a model with a common slope, and both varying intercepts and slopes, where we want to standardize the numeric predictor before entering the model. The formula would be `y ~ scale(x) + (scale(x)|g)`."
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "hairy-monster",
"metadata": {},
"outputs": [],
"source": [
"data = pd.DataFrame({\n",
" \"y\": rng.normal(size=50),\n",
" \"x\": rng.normal(loc=200, scale=20, size=50),\n",
" \"g\": rng.choice([\"A\", \"B\", \"C\"], size=50)\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "bibliographic-impression",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CommonEffectsMatrix with shape (50, 2)\n",
"Terms: \n",
" Intercept \n",
" kind: intercept\n",
" column: 0\n",
" scale(x) \n",
" kind: numeric\n",
" column: 1\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'\n"
]
}
],
"source": [
"dm = design_matrices(\"y ~ scale(x) + (scale(x)|g)\", data)\n",
"print(dm.common)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "hollywood-professional",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GroupEffectsMatrix with shape (50, 6)\n",
"Terms: \n",
" 1|g \n",
" kind: intercept\n",
" groups: ['A', 'B', 'C']\n",
" columns: 0:3\n",
" scale(x)|g \n",
" kind: numeric\n",
" groups: ['A', 'B', 'C']\n",
" columns: 3:6\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'\n"
]
}
],
"source": [
"print(dm.group)"
]
},
{
"cell_type": "markdown",
"id": "worst-pollution",
"metadata": {},
"source": [
"And now suppose we have new data for a set of new subjects. We can derive both common and group-specific design matrices for these new individuals without having to specify the model again. We can instead use the `.evaluate_new_data()` method. This method takes a data frame and returns either a new `CommonEffectsMatrix` or a new `GroupEffectsMatrix` depending on which object the method was called."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "transsexual-luxury",
"metadata": {},
"outputs": [],
"source": [
"data2 = pd.DataFrame({\n",
" \"x\": rng.normal(loc=180, scale=20,size=10),\n",
" \"g\": rng.choice([\"A\", \"B\", \"C\"], size=10)\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "assumed-header",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CommonEffectsMatrix with shape (10, 2)\n",
"Terms: \n",
" Intercept \n",
" kind: intercept\n",
" column: 0\n",
" scale(x) \n",
" kind: numeric\n",
" column: 1\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"common_new = dm.common.evaluate_new_data(data2)\n",
"common_new"
]
},
{
"cell_type": "markdown",
"id": "transsexual-objective",
"metadata": {},
"source": [
"And if we explore the design matrix for the common terms we notice the scaling of the new data used the mean and standard deviation `x` in the first dataset."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "activated-danish",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Intercept | \n",
" scale(x) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1.0 | \n",
" -1.111751 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.0 | \n",
" -1.113193 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" -0.820436 | \n",
"
\n",
" \n",
" 3 | \n",
" 1.0 | \n",
" -2.196410 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.0 | \n",
" -2.131558 | \n",
"
\n",
" \n",
" 5 | \n",
" 1.0 | \n",
" -1.244299 | \n",
"
\n",
" \n",
" 6 | \n",
" 1.0 | \n",
" -1.076445 | \n",
"
\n",
" \n",
" 7 | \n",
" 1.0 | \n",
" -0.859853 | \n",
"
\n",
" \n",
" 8 | \n",
" 1.0 | \n",
" -0.012279 | \n",
"
\n",
" \n",
" 9 | \n",
" 1.0 | \n",
" -0.351255 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Intercept scale(x)\n",
"0 1.0 -1.111751\n",
"1 1.0 -1.113193\n",
"2 1.0 -0.820436\n",
"3 1.0 -2.196410\n",
"4 1.0 -2.131558\n",
"5 1.0 -1.244299\n",
"6 1.0 -1.076445\n",
"7 1.0 -0.859853\n",
"8 1.0 -0.012279\n",
"9 1.0 -0.351255"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"common_new.as_dataframe()"
]
},
{
"cell_type": "markdown",
"id": "honest-catholic",
"metadata": {},
"source": [
"Something similar can be done with the group-specific part."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "cloudy-extreme",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"GroupEffectsMatrix with shape (10, 6)\n",
"Terms: \n",
" 1|g \n",
" kind: intercept\n",
" groups: ['A', 'B', 'C']\n",
" columns: 0:3\n",
" scale(x)|g \n",
" kind: numeric\n",
" groups: ['A', 'B', 'C']\n",
" columns: 3:6\n",
"\n",
"To access the actual design matrix do 'np.array(this_obj)'"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"group_new = dm.group.evaluate_new_data(data2)\n",
"group_new"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "signal-seventh",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0. , -1.11175056, -0. ],\n",
" [-0. , -1.11319326, -0. ],\n",
" [-0. , -0. , -0.82043629],\n",
" [-2.19640995, -0. , -0. ],\n",
" [-0. , -2.13155813, -0. ],\n",
" [-1.24429866, -0. , -0. ],\n",
" [-1.07644467, -0. , -0. ],\n",
" [-0. , -0.85985319, -0. ],\n",
" [-0.01227858, -0. , -0. ],\n",
" [-0.35125497, -0. , -0. ]])"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"group_new[\"scale(x)|g\"]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}