{ "cells": [ { "cell_type": "markdown", "id": "editorial-support", "metadata": {}, "source": [ "# Getting started" ] }, { "cell_type": "markdown", "id": "collectible-gather", "metadata": {}, "source": [ "formulae requires a working Python interpreter (3.7+) and the libraries NumPy, SciPy and pandas with versions specified in the [requirements.txt](https://github.com/bambinos/formulae/blob/master/requirements.txt) file.\n", "\n", "Assuming a standard Python environment is installed on your machine (including pip), the latest release of formulae can be installed in one line using pip:\n", "\n", "`pip install formulae`\n", "\n", "Alternatively, if you want the development version of the package you can install from GitHub:\n", "\n", "`pip install git+https://github.com/bambinos/formulae.git`" ] }, { "cell_type": "markdown", "id": "emerging-april", "metadata": {}, "source": [ "## User guide\n", "\n", "The main function is `design_matrices()`. It takes a model formula and a pandas DataFrame and returns an object of class `DesignMatrices` that contains information about the response, the common effects, and the group specific effects that can be accessed with the attributes `.response`, `.common`, and `.group` respectively. \n", "\n", "\n", "Model formulas are much like the very famous model formulas in the R programming language. The extension to group-specific effects is similar to the extension provided by the [lme4](https://CRAN.R-project.org/package=lme4) package, also in R." ] }, { "cell_type": "markdown", "id": "fatty-dividend", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "checked-moisture", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from formulae import design_matrices" ] }, { "cell_type": "markdown", "id": "functioning-petroleum", "metadata": {}, "source": [ "Let's simulate some data to use throughout examples here. The number of observations isn't too important. We keep it low to understand what is going on more easily. However, we do include both numeric and categoric variables to see how formulae handles these types in different scenarios." ] }, { "cell_type": "code", "execution_count": 2, "id": "scientific-armenia", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y1y2xzg
00.200521C-0.3514473Group 2
1-0.421306B-0.4311293Group 1
2-0.490128B-0.4127382Group 2
3-0.929507B0.5743373Group 2
41.515838A1.5613903Group 1
5-1.180238A-0.3571371Group 2
6-1.839936C-0.4532783Group 3
7-0.330336B-0.0391113Group 2
8-0.797019A1.8827482Group 2
9-0.030036A0.5496153Group 3
\n", "
" ], "text/plain": [ " y1 y2 x z g\n", "0 0.200521 C -0.351447 3 Group 2\n", "1 -0.421306 B -0.431129 3 Group 1\n", "2 -0.490128 B -0.412738 2 Group 2\n", "3 -0.929507 B 0.574337 3 Group 2\n", "4 1.515838 A 1.561390 3 Group 1\n", "5 -1.180238 A -0.357137 1 Group 2\n", "6 -1.839936 C -0.453278 3 Group 3\n", "7 -0.330336 B -0.039111 3 Group 2\n", "8 -0.797019 A 1.882748 2 Group 2\n", "9 -0.030036 A 0.549615 3 Group 3" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rng = np.random.default_rng(7355608)\n", "SIZE = 10\n", "data = pd.DataFrame(\n", " {\n", " \"y1\": rng.normal(size=SIZE),\n", " \"y2\": rng.choice([\"A\", \"B\", \"C\"], size=SIZE),\n", " \"x\": rng.normal(size=SIZE),\n", " \"z\": rng.choice([1, 2, 3], size=SIZE),\n", " \"g\": rng.choice([\"Group 1\", \"Group 2\", \"Group 3\"], size=SIZE),\n", " }\n", ")\n", "data" ] }, { "cell_type": "markdown", "id": "bronze-diving", "metadata": {}, "source": [ "## Creating and accessing design matrices" ] }, { "cell_type": "markdown", "id": "geological-reserve", "metadata": {}, "source": [ "Let's create a `DesignMatrices` object with a numeric response and a numeric predictor. The first argument we pass is the model formula, and the second is the data frame where we take variables from." ] }, { "cell_type": "code", "execution_count": 3, "id": "conservative-vision", "metadata": {}, "outputs": [], "source": [ "dm = design_matrices(\"y1 ~ x\", data)" ] }, { "cell_type": "markdown", "id": "requested-auction", "metadata": {}, "source": [ "Under `dm.common` we find an object of class `CommonEffectsMatrix` which stores the design matrix for the common part of the model, as well as other information about the terms in the matrix." ] }, { "cell_type": "code", "execution_count": 4, "id": "accepted-electronics", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CommonEffectsMatrix with shape (10, 2)\n", "Terms: \n", " Intercept \n", " kind: intercept\n", " column: 0\n", " x \n", " kind: numeric\n", " column: 1\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.common" ] }, { "cell_type": "markdown", "id": "banned-timothy", "metadata": {}, "source": [ "Use `.design_matrix` to obtain the design matrix for the common part of the model." ] }, { "cell_type": "code", "execution_count": 5, "id": "healthy-substitute", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1. , -0.351447 ],\n", " [ 1. , -0.43112895],\n", " [ 1. , -0.41273793],\n", " [ 1. , 0.57433652],\n", " [ 1. , 1.56139001],\n", " [ 1. , -0.35713698],\n", " [ 1. , -0.45327849],\n", " [ 1. , -0.03911089],\n", " [ 1. , 1.88274763],\n", " [ 1. , 0.54961543]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.common.design_matrix" ] }, { "cell_type": "markdown", "id": "90ef5349", "metadata": {}, "source": [ "Or `np.array(dm.common)`" ] }, { "cell_type": "code", "execution_count": 6, "id": "bf9039c9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1. , -0.351447 ],\n", " [ 1. , -0.43112895],\n", " [ 1. , -0.41273793],\n", " [ 1. , 0.57433652],\n", " [ 1. , 1.56139001],\n", " [ 1. , -0.35713698],\n", " [ 1. , -0.45327849],\n", " [ 1. , -0.03911089],\n", " [ 1. , 1.88274763],\n", " [ 1. , 0.54961543]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.array(dm.common)" ] }, { "cell_type": "markdown", "id": "professional-pride", "metadata": {}, "source": [ "Or call the `.as_dataframe()` method to return a data frame." ] }, { "cell_type": "code", "execution_count": 7, "id": "egyptian-plate", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptx
01.0-0.351447
11.0-0.431129
21.0-0.412738
31.00.574337
41.01.561390
\n", "
" ], "text/plain": [ " Intercept x\n", "0 1.0 -0.351447\n", "1 1.0 -0.431129\n", "2 1.0 -0.412738\n", "3 1.0 0.574337\n", "4 1.0 1.561390" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "f41c55db", "metadata": {}, "source": [ "Note the response term can be omitted and we still obtain the design matrix for the common predictors" ] }, { "cell_type": "code", "execution_count": 8, "id": "3fb94076", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptx
01.0-0.351447
11.0-0.431129
21.0-0.412738
31.00.574337
41.01.561390
\n", "
" ], "text/plain": [ " Intercept x\n", "0 1.0 -0.351447\n", "1 1.0 -0.431129\n", "2 1.0 -0.412738\n", "3 1.0 0.574337\n", "4 1.0 1.561390" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"x\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "temporal-wisconsin", "metadata": {}, "source": [ "## Categorical predictors\n", "\n", "Variables with categorical type in the data frame are recognized and handled as such. If there's a numerical variable that should be interpreted as categorical and you don't want to manipulate your data manually, you can wrap the name of the variable with `C()`, like `C(variable)`, and formulae will convert it to the categorical type.\n", "\n", "Categorical predictors are encoded using reference level encoding (i.e. the first level is taken as baseline). This way, the resulting design matrix is of full-rank." ] }, { "cell_type": "code", "execution_count": 9, "id": "scheduled-repair", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptg[Group 2]g[Group 3]
01.01.00.0
11.00.00.0
21.01.00.0
31.01.00.0
41.00.00.0
\n", "
" ], "text/plain": [ " Intercept g[Group 2] g[Group 3]\n", "0 1.0 1.0 0.0\n", "1 1.0 0.0 0.0\n", "2 1.0 1.0 0.0\n", "3 1.0 1.0 0.0\n", "4 1.0 0.0 0.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"g\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "passing-shopper", "metadata": {}, "source": [ "But if we don't have an intercept term, there's no need to use reference encoding for the categorical variables to avoid linear dependencies between the columns and formulae uses cell-means encoding." ] }, { "cell_type": "code", "execution_count": 10, "id": "exciting-ultimate", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
g[Group 1]g[Group 2]g[Group 3]
0010
1100
2010
3010
4100
\n", "
" ], "text/plain": [ " g[Group 1] g[Group 2] g[Group 3]\n", "0 0 1 0\n", "1 1 0 0\n", "2 0 1 0\n", "3 0 1 0\n", "4 1 0 0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"0 + g\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "after-label", "metadata": {}, "source": [ "Suppose that `z` actually represents a certain categorization represented by numbers. We can use `\"y1 ~ C(z)\"` and `z` will be internally interpreted as categorical instead of numeric." ] }, { "cell_type": "code", "execution_count": 11, "id": "final-lucas", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptC(z)[2]C(z)[3]
01.00.01.0
11.00.01.0
21.01.00.0
31.00.01.0
41.00.01.0
\n", "
" ], "text/plain": [ " Intercept C(z)[2] C(z)[3]\n", "0 1.0 0.0 1.0\n", "1 1.0 0.0 1.0\n", "2 1.0 1.0 0.0\n", "3 1.0 0.0 1.0\n", "4 1.0 0.0 1.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"C(z)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "sought-cover", "metadata": {}, "source": [ "By default, the `C()` wrapper takes the first level as the reference level (after being sorted with `sorted()`), but we can override this by passing a list with a custom ordering. " ] }, { "cell_type": "code", "execution_count": 12, "id": "sustained-orleans", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptC(z, levels = lvl)[3]C(z, levels = lvl)[1]
01.01.00.0
11.01.00.0
21.00.00.0
31.01.00.0
41.01.00.0
\n", "
" ], "text/plain": [ " Intercept C(z, levels = lvl)[3] C(z, levels = lvl)[1]\n", "0 1.0 1.0 0.0\n", "1 1.0 1.0 0.0\n", "2 1.0 0.0 0.0\n", "3 1.0 1.0 0.0\n", "4 1.0 1.0 0.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 2 is taken as baseline, and then it comes 3 and 1.\n", "lvl = [2, 3, 1]\n", "dm = design_matrices(\"C(z, levels=lvl)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "03178207", "metadata": {}, "source": [ "Finally, if we want to convert one variable into a binary variable, we can use the internal `binary()` function:" ] }, { "cell_type": "code", "execution_count": 13, "id": "2689c46e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptbinary(z, 2)
010
110
211
310
410
\n", "
" ], "text/plain": [ " Intercept binary(z, 2)\n", "0 1 0\n", "1 1 0\n", "2 1 1\n", "3 1 0\n", "4 1 0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"binary(z, 2)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "aa87b287", "metadata": {}, "source": [ "which also works for categorical variables" ] }, { "cell_type": "code", "execution_count": 14, "id": "4dc247ed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptbinary(g, Group 1)
010
111
210
310
411
\n", "
" ], "text/plain": [ " Intercept binary(g, Group 1)\n", "0 1 0\n", "1 1 1\n", "2 1 0\n", "3 1 0\n", "4 1 1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"binary(g, 'Group 1')\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "perfect-roller", "metadata": {}, "source": [ "## Adding interactions\n", "\n", "The interaction operator is `:`." ] }, { "cell_type": "code", "execution_count": 15, "id": "convertible-tactics", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptx:z
01.0-1.054341
11.0-1.293387
21.0-0.825476
31.01.723010
41.04.684170
\n", "
" ], "text/plain": [ " Intercept x:z\n", "0 1.0 -1.054341\n", "1 1.0 -1.293387\n", "2 1.0 -0.825476\n", "3 1.0 1.723010\n", "4 1.0 4.684170" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"x:z\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "happy-lobby", "metadata": {}, "source": [ "The `*` operator is known as the full interaction operator because `x*z` is equivalent to `x + z + x:z`" ] }, { "cell_type": "code", "execution_count": 16, "id": "impressive-attention", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptxzx:z
01.0-0.3514473.0-1.054341
11.0-0.4311293.0-1.293387
21.0-0.4127382.0-0.825476
31.00.5743373.01.723010
41.01.5613903.04.684170
\n", "
" ], "text/plain": [ " Intercept x z x:z\n", "0 1.0 -0.351447 3.0 -1.054341\n", "1 1.0 -0.431129 3.0 -1.293387\n", "2 1.0 -0.412738 2.0 -0.825476\n", "3 1.0 0.574337 3.0 1.723010\n", "4 1.0 1.561390 3.0 4.684170" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"x*z\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "greater-scale", "metadata": {}, "source": [ "And both interaction operators can be used with mixtures of numerical and categorical variables." ] }, { "cell_type": "code", "execution_count": 17, "id": "liberal-bandwidth", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xg[Group 1]g[Group 2]g[Group 3]x:g[Group 2]x:g[Group 3]
0-0.3514470.01.00.0-0.351447-0.0
1-0.4311291.00.00.0-0.000000-0.0
2-0.4127380.01.00.0-0.412738-0.0
30.5743370.01.00.00.5743370.0
41.5613901.00.00.00.0000000.0
\n", "
" ], "text/plain": [ " x g[Group 1] g[Group 2] g[Group 3] x:g[Group 2] x:g[Group 3]\n", "0 -0.351447 0.0 1.0 0.0 -0.351447 -0.0\n", "1 -0.431129 1.0 0.0 0.0 -0.000000 -0.0\n", "2 -0.412738 0.0 1.0 0.0 -0.412738 -0.0\n", "3 0.574337 0.0 1.0 0.0 0.574337 0.0\n", "4 1.561390 1.0 0.0 0.0 0.000000 0.0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"0 + x*g\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "filled-wichita", "metadata": {}, "source": [ "## Function calls\n", "\n", "A term does not need to be just a variable in a data frame. It can also be the result of a function call. Formulae allows you to built-in functions as well as functions available from the top-level environment where `design_matrices` is being called.\n", "\n", "Since we've loaded the NumPy library as `np`, we can access its namespace as we would do in regular Python code." ] }, { "cell_type": "code", "execution_count": 18, "id": "angry-consistency", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptnp.exp(x)
01.00.703669
11.00.649775
21.00.661836
31.01.775952
41.04.765441
\n", "
" ], "text/plain": [ " Intercept np.exp(x)\n", "0 1.0 0.703669\n", "1 1.0 0.649775\n", "2 1.0 0.661836\n", "3 1.0 1.775952\n", "4 1.0 4.765441" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"np.exp(x)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "educated-kazakhstan", "metadata": {}, "source": [ "We can also use our own custom functions" ] }, { "cell_type": "code", "execution_count": 19, "id": "voluntary-compatibility", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptadd(x, 10)
01.09.648553
11.09.568871
21.09.587262
31.010.574337
41.011.561390
\n", "
" ], "text/plain": [ " Intercept add(x, 10)\n", "0 1.0 9.648553\n", "1 1.0 9.568871\n", "2 1.0 9.587262\n", "3 1.0 10.574337\n", "4 1.0 11.561390" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def add(x, how_much):\n", " return x + how_much\n", "\n", "dm = design_matrices(\"add(x, 10)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "checked-corrections", "metadata": {}, "source": [ "Or even nested function calls!" ] }, { "cell_type": "code", "execution_count": 20, "id": "ready-application", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptadd(np.exp(x), 5)
01.05.703669
11.05.649775
21.05.661836
31.06.775952
41.09.765441
\n", "
" ], "text/plain": [ " Intercept add(np.exp(x), 5)\n", "0 1.0 5.703669\n", "1 1.0 5.649775\n", "2 1.0 5.661836\n", "3 1.0 6.775952\n", "4 1.0 9.765441" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"add(np.exp(x), 5)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "reported-conditioning", "metadata": {}, "source": [ "## Built-in transformations\n", "\n", "Formulae comes with some built-in transformations that come very handy when doing statistical modeling. These functions are\n", "\n", "* `center()`: Centers the variable around its mean.\n", "* `scale()`: Centers the variable around its mean and divides it by its standard deviation.\n", "* `standardize()`: An alias for `scale()`.\n", "* `bs()`: Create a Basis Spline, a.k.a. B-Spline. " ] }, { "cell_type": "code", "execution_count": 21, "id": "flexible-watershed", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptscale(x)
01.0-0.732600
11.0-0.829284
21.0-0.806969
31.00.390720
41.01.588384
\n", "
" ], "text/plain": [ " Intercept scale(x)\n", "0 1.0 -0.732600\n", "1 1.0 -0.829284\n", "2 1.0 -0.806969\n", "3 1.0 0.390720\n", "4 1.0 1.588384" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"scale(x)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "code", "execution_count": 22, "id": "lovely-arlington", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n", "-2.2204460492503132e-17\n" ] } ], "source": [ "# check mean is 0 and std is 1\n", "print(np.std(dm.common.as_dataframe()[\"scale(x)\"]))\n", "print(np.mean(dm.common.as_dataframe()[\"scale(x)\"]))" ] }, { "cell_type": "code", "execution_count": 23, "id": "2567616c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptbs(x, df = 6)[0]bs(x, df = 6)[1]bs(x, df = 6)[2]bs(x, df = 6)[3]bs(x, df = 6)[4]bs(x, df = 6)[5]
01.00.2810920.6552370.0634330.0002370.0000000.000000e+00
11.00.7037420.0868200.0007570.0000000.0000000.000000e+00
21.00.7378380.2408730.0046440.0000000.0000000.000000e+00
31.00.0000000.0000000.3593800.5023440.1382761.039084e-07
41.00.0000000.0000000.0053250.0929990.4703744.313026e-01
51.00.3129500.6318760.0550130.0001610.0000000.000000e+00
61.00.0000000.0000000.0000000.0000000.0000000.000000e+00
71.00.0000000.2969840.6111610.0907000.0011550.000000e+00
81.00.0000000.0000000.0000000.0000000.0000001.000000e+00
91.00.0000000.0000080.3801230.4944940.1253750.000000e+00
\n", "
" ], "text/plain": [ " Intercept bs(x, df = 6)[0] bs(x, df = 6)[1] bs(x, df = 6)[2] \\\n", "0 1.0 0.281092 0.655237 0.063433 \n", "1 1.0 0.703742 0.086820 0.000757 \n", "2 1.0 0.737838 0.240873 0.004644 \n", "3 1.0 0.000000 0.000000 0.359380 \n", "4 1.0 0.000000 0.000000 0.005325 \n", "5 1.0 0.312950 0.631876 0.055013 \n", "6 1.0 0.000000 0.000000 0.000000 \n", "7 1.0 0.000000 0.296984 0.611161 \n", "8 1.0 0.000000 0.000000 0.000000 \n", "9 1.0 0.000000 0.000008 0.380123 \n", "\n", " bs(x, df = 6)[3] bs(x, df = 6)[4] bs(x, df = 6)[5] \n", "0 0.000237 0.000000 0.000000e+00 \n", "1 0.000000 0.000000 0.000000e+00 \n", "2 0.000000 0.000000 0.000000e+00 \n", "3 0.502344 0.138276 1.039084e-07 \n", "4 0.092999 0.470374 4.313026e-01 \n", "5 0.000161 0.000000 0.000000e+00 \n", "6 0.000000 0.000000 0.000000e+00 \n", "7 0.090700 0.001155 0.000000e+00 \n", "8 0.000000 0.000000 1.000000e+00 \n", "9 0.494494 0.125375 0.000000e+00 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"bs(x, df=6)\", data)\n", "dm.common.as_dataframe()" ] }, { "cell_type": "markdown", "id": "experienced-baghdad", "metadata": {}, "source": [ "These transformations are known as **stateful transformations** because they remember the original values of parameters involved in the transformation (the mean and the standard deviation). This feature is critical when one wants to derive a design matrix based on new data, as required when evaluating the fit of a model on a new dataset." ] }, { "cell_type": "markdown", "id": "genuine-engineering", "metadata": {}, "source": [ "### Suppresing the default intercept\n", "\n", "By default, formulae includes an intercept in the resulting desing matrix. But this can be ommited. Just add a `0` to the model formula as in the following example. " ] }, { "cell_type": "code", "execution_count": 24, "id": "sharp-simple", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
0-0.351447
1-0.431129
2-0.412738
30.574337
41.561390
\n", "
" ], "text/plain": [ " x\n", "0 -0.351447\n", "1 -0.431129\n", "2 -0.412738\n", "3 0.574337\n", "4 1.561390" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"0 + x\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "progressive-physics", "metadata": {}, "source": [ "`\"y1 ~ -1 + x\"` and `\"y1 ~ x - 1\"` are equivalent alternatives." ] }, { "cell_type": "markdown", "id": "british-mount", "metadata": {}, "source": [ "## Operations between terms\n", "\n", "It is possible someone wants to use a term that is the result of adding two other terms, such as `x + z`. However, if we do `\"y ~ x + z\"`, formulae will understand we want a model with two predictor terms, `x`, and `z`. Fortunately, there are two (equivalent) ways of telling formulae that we want a term that is the result of performing operations between other terms. This can be achieved either by using the `I()` wrapper or its shorthand `{}`." ] }, { "cell_type": "code", "execution_count": 25, "id": "occupied-times", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptI(x + z)
01.02.648553
11.02.568871
21.01.587262
31.03.574337
41.04.561390
\n", "
" ], "text/plain": [ " Intercept I(x + z)\n", "0 1.0 2.648553\n", "1 1.0 2.568871\n", "2 1.0 1.587262\n", "3 1.0 3.574337\n", "4 1.0 4.561390" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"I(x + z)\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "code", "execution_count": 26, "id": "seasonal-dollar", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InterceptI(x / z)
01.0-0.117149
11.0-0.143710
21.0-0.206369
31.00.191446
41.00.520463
\n", "
" ], "text/plain": [ " Intercept I(x / z)\n", "0 1.0 -0.117149\n", "1 1.0 -0.143710\n", "2 1.0 -0.206369\n", "3 1.0 0.191446\n", "4 1.0 0.520463" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"{x / z}\", data)\n", "dm.common.as_dataframe().head()" ] }, { "cell_type": "markdown", "id": "changing-latter", "metadata": {}, "source": [ "As you may have noticed, the output shows `I()` instead of `{}`. That's because `{}` is translated into `I()` internally." ] }, { "cell_type": "markdown", "id": "exciting-azerbaijan", "metadata": {}, "source": [ "## Categorical responses\n", "\n", "Formulae can work with categorical responses with 2 or more levels. When the variable contains only two levels, formulae flags it as binary.\n", "\n", "Recall `y2` is a categorical variable with levels `A`, `B` and `C`. In this case, formulae set the `binary` flag to `False` since this variable contains more than two categories." ] }, { "cell_type": "code", "execution_count": 27, "id": "blocked-configuration", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ResponseVector \n", " name: y2\n", " kind: categoric\n", " length: 10\n", " binary: False\n", " levels: ['A', 'B', 'C']\n", "\n", "To access the actual design vector do 'np.array(this_obj)'" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"y2 ~ x\", data)\n", "dm.response" ] }, { "cell_type": "markdown", "id": "9c604a55", "metadata": {}, "source": [ "The `.design_vector` array is a dummy based design matrix where each column represents a level of the response" ] }, { "cell_type": "code", "execution_count": 28, "id": "julian-viking", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 1],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [1, 0, 0],\n", " [0, 0, 1],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [1, 0, 0]])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.response.design_vector" ] }, { "cell_type": "code", "execution_count": 29, "id": "24395f20", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 1],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [1, 0, 0],\n", " [0, 0, 1],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [1, 0, 0]])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.array(dm.response)" ] }, { "cell_type": "markdown", "id": "amateur-contest", "metadata": {}, "source": [ "It is also possible to override the default behavior and indicate the reference level with some syntax sugar. If you say `response[level]`, formulae will interpret `response` as a binary variable and set the `'level'` category as reference. For example" ] }, { "cell_type": "code", "execution_count": 30, "id": "social-bahrain", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ResponseVector \n", " name: y2\n", " kind: categoric\n", " length: 10\n", " binary: True\n", " success: B\n", "\n", "To access the actual design vector do 'np.array(this_obj)'" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"y2[B] ~ x\", data)\n", "dm.response" ] }, { "cell_type": "code", "execution_count": 31, "id": "rapid-deposit", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 'y2' is equal to B in four cases\n", "np.array(dm.response)" ] }, { "cell_type": "markdown", "id": "2168f2f3", "metadata": {}, "source": [ "If it is the case that the reference level contains spaces, we can wrap the response within quotes:" ] }, { "cell_type": "code", "execution_count": 32, "id": "36be2de7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ResponseVector \n", " name: g\n", " kind: categoric\n", " length: 10\n", " binary: True\n", " success: Group 3\n", "\n", "To access the actual design vector do 'np.array(this_obj)'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"g['Group 3'] ~ x\", data)\n", "dm.response" ] }, { "cell_type": "markdown", "id": "e11ea741", "metadata": {}, "source": [ "As with the predictors, the `binary()` function can be used on the LHS as well" ] }, { "cell_type": "code", "execution_count": 33, "id": "8cd32b33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ResponseVector \n", " name: binary(g, Group 3)\n", " kind: numeric\n", " length: 10\n", "\n", "To access the actual design vector do 'np.array(this_obj)'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"binary(g, 'Group 3') ~ x\", data)\n", "dm.response" ] }, { "cell_type": "markdown", "id": "eb3beadd", "metadata": {}, "source": [ "Note `binary()` outputs a numeric variable, so the response is recognized as numeric instead of categoric." ] }, { "cell_type": "markdown", "id": "cognitive-tablet", "metadata": {}, "source": [ "## Group-specific effects\n", "\n", "formulae also handles group-specific effects. They are specified using the pipe operator `|`, similar to the [lme4](https://CRAN.R-project.org/package=lme4) R package. These terms are of the form `(expr|factor)` where the `expr` represents the effect and `factor` the group. [This vignette](https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf) from lme4 provides much better information on how this notation works. In a nutshell, group-specific intercepts are specified via `(1|group)` and group-specific slopes are specified via `(variable|group)`. The latter also adds a group-specific intercept by default. This can be overriden doing `(0 + variable|group)`.\n", "\n", "First, let's see a formula specifying a common predictor, `x`, and a group-specific intercept for each level of `g`." ] }, { "cell_type": "code", "execution_count": 34, "id": "indirect-hanging", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GroupEffectsMatrix with shape (10, 3)\n", "Terms: \n", " 1|g \n", " kind: intercept\n", " groups: ['Group 1', 'Group 2', 'Group 3']\n", " columns: 0:3\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"y1 ~ x + (1|g)\", data)\n", "dm.group" ] }, { "cell_type": "code", "execution_count": 35, "id": "bronze-ethnic", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptx
01.0-0.351447
11.0-0.431129
21.0-0.412738
31.00.574337
41.01.561390
51.0-0.357137
61.0-0.453278
71.0-0.039111
81.01.882748
91.00.549615
\n", "
" ], "text/plain": [ " Intercept x\n", "0 1.0 -0.351447\n", "1 1.0 -0.431129\n", "2 1.0 -0.412738\n", "3 1.0 0.574337\n", "4 1.0 1.561390\n", "5 1.0 -0.357137\n", "6 1.0 -0.453278\n", "7 1.0 -0.039111\n", "8 1.0 1.882748\n", "9 1.0 0.549615" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.common.as_dataframe()" ] }, { "cell_type": "code", "execution_count": 36, "id": "material-strength", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 0],\n", " [1, 0, 0],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [1, 0, 0],\n", " [0, 1, 0],\n", " [0, 0, 1],\n", " [0, 1, 0],\n", " [0, 1, 0],\n", " [0, 0, 1]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.array(dm.group)" ] }, { "cell_type": "markdown", "id": "accompanied-conviction", "metadata": {}, "source": [ "If we now use `(x|g)` we get both the group-specific slope, as well as the intercept, because it is included by default." ] }, { "cell_type": "code", "execution_count": 37, "id": "parliamentary-access", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GroupEffectsMatrix with shape (10, 6)\n", "Terms: \n", " 1|g \n", " kind: intercept\n", " groups: ['Group 1', 'Group 2', 'Group 3']\n", " columns: 0:3\n", " x|g \n", " kind: numeric\n", " groups: ['Group 1', 'Group 2', 'Group 3']\n", " columns: 3:6\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"y1 ~ x + (x|g)\", data)\n", "dm.group" ] }, { "cell_type": "markdown", "id": "independent-sitting", "metadata": {}, "source": [ "We can access the different sub-matrices corresponding to each term as follows" ] }, { "cell_type": "code", "execution_count": 38, "id": "fifth-grass", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 1., 0.],\n", " [1., 0., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [1., 0., 0.],\n", " [0., 1., 0.],\n", " [0., 0., 1.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 0., 1.]])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.group[\"1|g\"] # same as before" ] }, { "cell_type": "code", "execution_count": 39, "id": "impressive-repeat", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0. , -0.351447 , -0. ],\n", " [-0.43112895, -0. , -0. ],\n", " [-0. , -0.41273793, -0. ],\n", " [ 0. , 0.57433652, 0. ],\n", " [ 1.56139001, 0. , 0. ],\n", " [-0. , -0.35713698, -0. ],\n", " [-0. , -0. , -0.45327849],\n", " [-0. , -0.03911089, -0. ],\n", " [ 0. , 1.88274763, 0. ],\n", " [ 0. , 0. , 0.54961543]])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm.group[\"x|g\"]" ] }, { "cell_type": "markdown", "id": "tutorial-movement", "metadata": {}, "source": [ "But as with the common part, this default intercept can be removed" ] }, { "cell_type": "code", "execution_count": 40, "id": "stainless-scottish", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GroupEffectsMatrix with shape (10, 3)\n", "Terms: \n", " x|g \n", " kind: numeric\n", " groups: ['Group 1', 'Group 2', 'Group 3']\n", " columns: 0:3\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = design_matrices(\"y1 ~ x + (0 + x|g)\", data)\n", "dm.group # note there is no 1|x in the terms" ] }, { "cell_type": "markdown", "id": "great-timer", "metadata": {}, "source": [ "As a final remark, we note that the `expr` part of a group-specific can also be a categorical variable or an interaction. Similar rules apply for the `factor` part." ] }, { "cell_type": "markdown", "id": "complete-lebanon", "metadata": {}, "source": [ "## Evaluating new data" ] }, { "cell_type": "markdown", "id": "departmental-federation", "metadata": {}, "source": [ "Both `CommonEffectsMatrix` and `GroupEffectsMatrix` are provided with a method that receives a data frame, with the same structure as the original data frame, and returns a new design matrix.\n", "\n", "Suppose we have measured three variables, `y`, `x` and `g` on some subjects. `y` and `x` are numerics, while `g` is a categorical variable that represents some grouping. We also have a model with a common slope, and both varying intercepts and slopes, where we want to standardize the numeric predictor before entering the model. The formula would be `y ~ scale(x) + (scale(x)|g)`." ] }, { "cell_type": "code", "execution_count": 41, "id": "hairy-monster", "metadata": {}, "outputs": [], "source": [ "data = pd.DataFrame({\n", " \"y\": rng.normal(size=50),\n", " \"x\": rng.normal(loc=200, scale=20, size=50),\n", " \"g\": rng.choice([\"A\", \"B\", \"C\"], size=50)\n", "})" ] }, { "cell_type": "code", "execution_count": 42, "id": "bibliographic-impression", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CommonEffectsMatrix with shape (50, 2)\n", "Terms: \n", " Intercept \n", " kind: intercept\n", " column: 0\n", " scale(x) \n", " kind: numeric\n", " column: 1\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'\n" ] } ], "source": [ "dm = design_matrices(\"y ~ scale(x) + (scale(x)|g)\", data)\n", "print(dm.common)" ] }, { "cell_type": "code", "execution_count": 43, "id": "hollywood-professional", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GroupEffectsMatrix with shape (50, 6)\n", "Terms: \n", " 1|g \n", " kind: intercept\n", " groups: ['A', 'B', 'C']\n", " columns: 0:3\n", " scale(x)|g \n", " kind: numeric\n", " groups: ['A', 'B', 'C']\n", " columns: 3:6\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'\n" ] } ], "source": [ "print(dm.group)" ] }, { "cell_type": "markdown", "id": "worst-pollution", "metadata": {}, "source": [ "And now suppose we have new data for a set of new subjects. We can derive both common and group-specific design matrices for these new individuals without having to specify the model again. We can instead use the `.evaluate_new_data()` method. This method takes a data frame and returns either a new `CommonEffectsMatrix` or a new `GroupEffectsMatrix` depending on which object the method was called." ] }, { "cell_type": "code", "execution_count": 44, "id": "transsexual-luxury", "metadata": {}, "outputs": [], "source": [ "data2 = pd.DataFrame({\n", " \"x\": rng.normal(loc=180, scale=20,size=10),\n", " \"g\": rng.choice([\"A\", \"B\", \"C\"], size=10)\n", "})" ] }, { "cell_type": "code", "execution_count": 45, "id": "assumed-header", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CommonEffectsMatrix with shape (10, 2)\n", "Terms: \n", " Intercept \n", " kind: intercept\n", " column: 0\n", " scale(x) \n", " kind: numeric\n", " column: 1\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "common_new = dm.common.evaluate_new_data(data2)\n", "common_new" ] }, { "cell_type": "markdown", "id": "transsexual-objective", "metadata": {}, "source": [ "And if we explore the design matrix for the common terms we notice the scaling of the new data used the mean and standard deviation `x` in the first dataset." ] }, { "cell_type": "code", "execution_count": 46, "id": "activated-danish", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Interceptscale(x)
01.0-1.111751
11.0-1.113193
21.0-0.820436
31.0-2.196410
41.0-2.131558
51.0-1.244299
61.0-1.076445
71.0-0.859853
81.0-0.012279
91.0-0.351255
\n", "
" ], "text/plain": [ " Intercept scale(x)\n", "0 1.0 -1.111751\n", "1 1.0 -1.113193\n", "2 1.0 -0.820436\n", "3 1.0 -2.196410\n", "4 1.0 -2.131558\n", "5 1.0 -1.244299\n", "6 1.0 -1.076445\n", "7 1.0 -0.859853\n", "8 1.0 -0.012279\n", "9 1.0 -0.351255" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "common_new.as_dataframe()" ] }, { "cell_type": "markdown", "id": "honest-catholic", "metadata": {}, "source": [ "Something similar can be done with the group-specific part." ] }, { "cell_type": "code", "execution_count": 47, "id": "cloudy-extreme", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GroupEffectsMatrix with shape (10, 6)\n", "Terms: \n", " 1|g \n", " kind: intercept\n", " groups: ['A', 'B', 'C']\n", " columns: 0:3\n", " scale(x)|g \n", " kind: numeric\n", " groups: ['A', 'B', 'C']\n", " columns: 3:6\n", "\n", "To access the actual design matrix do 'np.array(this_obj)'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_new = dm.group.evaluate_new_data(data2)\n", "group_new" ] }, { "cell_type": "code", "execution_count": 48, "id": "signal-seventh", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0. , -1.11175056, -0. ],\n", " [-0. , -1.11319326, -0. ],\n", " [-0. , -0. , -0.82043629],\n", " [-2.19640995, -0. , -0. ],\n", " [-0. , -2.13155813, -0. ],\n", " [-1.24429866, -0. , -0. ],\n", " [-1.07644467, -0. , -0. ],\n", " [-0. , -0.85985319, -0. ],\n", " [-0.01227858, -0. , -0. ],\n", " [-0.35125497, -0. , -0. ]])" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "group_new[\"scale(x)|g\"]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }