Getting started#

formulae requires a working Python interpreter (3.7+) and the libraries NumPy, SciPy and pandas with versions specified in the requirements.txt file.

Assuming a standard Python environment is installed on your machine (including pip), the latest release of formulae can be installed in one line using pip:

pip install formulae

Alternatively, if you want the development version of the package you can install from GitHub:

pip install git+https://github.com/bambinos/formulae.git

User guide#

The main function is design_matrices(). It takes a model formula and a pandas DataFrame and returns an object of class DesignMatrices that contains information about the response, the common effects, and the group specific effects that can be accessed with the attributes .response, .common, and .group respectively.

Model formulas are much like the very famous model formulas in the R programming language. The extension to group-specific effects is similar to the extension provided by the lme4 package, also in R.

Setup#

[1]:
import numpy as np
import pandas as pd

from formulae import design_matrices

Let’s simulate some data to use throughout examples here. The number of observations isn’t too important. We keep it low to understand what is going on more easily. However, we do include both numeric and categoric variables to see how formulae handles these types in different scenarios.

[2]:
rng = np.random.default_rng(7355608)
SIZE = 10
data = pd.DataFrame(
    {
        "y1": rng.normal(size=SIZE),
        "y2": rng.choice(["A", "B", "C"], size=SIZE),
        "x": rng.normal(size=SIZE),
        "z": rng.choice([1, 2, 3], size=SIZE),
        "g": rng.choice(["Group 1", "Group 2", "Group 3"], size=SIZE),
    }
)
data
[2]:
y1 y2 x z g
0 0.200521 C -0.351447 3 Group 2
1 -0.421306 B -0.431129 3 Group 1
2 -0.490128 B -0.412738 2 Group 2
3 -0.929507 B 0.574337 3 Group 2
4 1.515838 A 1.561390 3 Group 1
5 -1.180238 A -0.357137 1 Group 2
6 -1.839936 C -0.453278 3 Group 3
7 -0.330336 B -0.039111 3 Group 2
8 -0.797019 A 1.882748 2 Group 2
9 -0.030036 A 0.549615 3 Group 3

Creating and accessing design matrices#

Let’s create a DesignMatrices object with a numeric response and a numeric predictor. The first argument we pass is the model formula, and the second is the data frame where we take variables from.

[3]:
dm = design_matrices("y1 ~ x", data)

Under dm.common we find an object of class CommonEffectsMatrix which stores the design matrix for the common part of the model, as well as other information about the terms in the matrix.

[4]:
dm.common
[4]:
CommonEffectsMatrix with shape (10, 2)
Terms:
  Intercept
    kind: intercept
    column: 0
  x
    kind: numeric
    column: 1

To access the actual design matrix do 'np.array(this_obj)'

Use .design_matrix to obtain the design matrix for the common part of the model.

[5]:
dm.common.design_matrix
[5]:
array([[ 1.        , -0.351447  ],
       [ 1.        , -0.43112895],
       [ 1.        , -0.41273793],
       [ 1.        ,  0.57433652],
       [ 1.        ,  1.56139001],
       [ 1.        , -0.35713698],
       [ 1.        , -0.45327849],
       [ 1.        , -0.03911089],
       [ 1.        ,  1.88274763],
       [ 1.        ,  0.54961543]])

Or np.array(dm.common)

[6]:
np.array(dm.common)
[6]:
array([[ 1.        , -0.351447  ],
       [ 1.        , -0.43112895],
       [ 1.        , -0.41273793],
       [ 1.        ,  0.57433652],
       [ 1.        ,  1.56139001],
       [ 1.        , -0.35713698],
       [ 1.        , -0.45327849],
       [ 1.        , -0.03911089],
       [ 1.        ,  1.88274763],
       [ 1.        ,  0.54961543]])

Or call the .as_dataframe() method to return a data frame.

[7]:
dm.common.as_dataframe().head()
[7]:
Intercept x
0 1.0 -0.351447
1 1.0 -0.431129
2 1.0 -0.412738
3 1.0 0.574337
4 1.0 1.561390

Note the response term can be omitted and we still obtain the design matrix for the common predictors

[8]:
dm = design_matrices("x", data)
dm.common.as_dataframe().head()
[8]:
Intercept x
0 1.0 -0.351447
1 1.0 -0.431129
2 1.0 -0.412738
3 1.0 0.574337
4 1.0 1.561390

Categorical predictors#

Variables with categorical type in the data frame are recognized and handled as such. If there’s a numerical variable that should be interpreted as categorical and you don’t want to manipulate your data manually, you can wrap the name of the variable with C(), like C(variable), and formulae will convert it to the categorical type.

Categorical predictors are encoded using reference level encoding (i.e. the first level is taken as baseline). This way, the resulting design matrix is of full-rank.

[9]:
dm = design_matrices("g", data)
dm.common.as_dataframe().head()
[9]:
Intercept g[Group 2] g[Group 3]
0 1.0 1.0 0.0
1 1.0 0.0 0.0
2 1.0 1.0 0.0
3 1.0 1.0 0.0
4 1.0 0.0 0.0

But if we don’t have an intercept term, there’s no need to use reference encoding for the categorical variables to avoid linear dependencies between the columns and formulae uses cell-means encoding.

[10]:
dm = design_matrices("0 + g", data)
dm.common.as_dataframe().head()
[10]:
g[Group 1] g[Group 2] g[Group 3]
0 0 1 0
1 1 0 0
2 0 1 0
3 0 1 0
4 1 0 0

Suppose that z actually represents a certain categorization represented by numbers. We can use "y1 ~ C(z)" and z will be internally interpreted as categorical instead of numeric.

[11]:
dm = design_matrices("C(z)", data)
dm.common.as_dataframe().head()
[11]:
Intercept C(z)[2] C(z)[3]
0 1.0 0.0 1.0
1 1.0 0.0 1.0
2 1.0 1.0 0.0
3 1.0 0.0 1.0
4 1.0 0.0 1.0

By default, the C() wrapper takes the first level as the reference level (after being sorted with sorted()), but we can override this by passing a list with a custom ordering.

[12]:
# 2 is taken as baseline, and then it comes 3 and 1.
lvl = [2, 3, 1]
dm = design_matrices("C(z, levels=lvl)", data)
dm.common.as_dataframe().head()
[12]:
Intercept C(z, levels = lvl)[3] C(z, levels = lvl)[1]
0 1.0 1.0 0.0
1 1.0 1.0 0.0
2 1.0 0.0 0.0
3 1.0 1.0 0.0
4 1.0 1.0 0.0

Finally, if we want to convert one variable into a binary variable, we can use the internal binary() function:

[13]:
dm = design_matrices("binary(z, 2)", data)
dm.common.as_dataframe().head()
[13]:
Intercept binary(z, 2)
0 1 0
1 1 0
2 1 1
3 1 0
4 1 0

which also works for categorical variables

[14]:
dm = design_matrices("binary(g, 'Group 1')", data)
dm.common.as_dataframe().head()
[14]:
Intercept binary(g, Group 1)
0 1 0
1 1 1
2 1 0
3 1 0
4 1 1

Adding interactions#

The interaction operator is :.

[15]:
dm = design_matrices("x:z", data)
dm.common.as_dataframe().head()
[15]:
Intercept x:z
0 1.0 -1.054341
1 1.0 -1.293387
2 1.0 -0.825476
3 1.0 1.723010
4 1.0 4.684170

The * operator is known as the full interaction operator because x*z is equivalent to x + z + x:z

[16]:
dm = design_matrices("x*z", data)
dm.common.as_dataframe().head()
[16]:
Intercept x z x:z
0 1.0 -0.351447 3.0 -1.054341
1 1.0 -0.431129 3.0 -1.293387
2 1.0 -0.412738 2.0 -0.825476
3 1.0 0.574337 3.0 1.723010
4 1.0 1.561390 3.0 4.684170

And both interaction operators can be used with mixtures of numerical and categorical variables.

[17]:
dm = design_matrices("0 + x*g", data)
dm.common.as_dataframe().head()
[17]:
x g[Group 1] g[Group 2] g[Group 3] x:g[Group 2] x:g[Group 3]
0 -0.351447 0.0 1.0 0.0 -0.351447 -0.0
1 -0.431129 1.0 0.0 0.0 -0.000000 -0.0
2 -0.412738 0.0 1.0 0.0 -0.412738 -0.0
3 0.574337 0.0 1.0 0.0 0.574337 0.0
4 1.561390 1.0 0.0 0.0 0.000000 0.0

Function calls#

A term does not need to be just a variable in a data frame. It can also be the result of a function call. Formulae allows you to built-in functions as well as functions available from the top-level environment where design_matrices is being called.

Since we’ve loaded the NumPy library as np, we can access its namespace as we would do in regular Python code.

[18]:
dm = design_matrices("np.exp(x)", data)
dm.common.as_dataframe().head()
[18]:
Intercept np.exp(x)
0 1.0 0.703669
1 1.0 0.649775
2 1.0 0.661836
3 1.0 1.775952
4 1.0 4.765441

We can also use our own custom functions

[19]:
def add(x, how_much):
    return x + how_much

dm = design_matrices("add(x, 10)", data)
dm.common.as_dataframe().head()
[19]:
Intercept add(x, 10)
0 1.0 9.648553
1 1.0 9.568871
2 1.0 9.587262
3 1.0 10.574337
4 1.0 11.561390

Or even nested function calls!

[20]:
dm = design_matrices("add(np.exp(x), 5)", data)
dm.common.as_dataframe().head()
[20]:
Intercept add(np.exp(x), 5)
0 1.0 5.703669
1 1.0 5.649775
2 1.0 5.661836
3 1.0 6.775952
4 1.0 9.765441

Built-in transformations#

Formulae comes with some built-in transformations that come very handy when doing statistical modeling. These functions are

  • center(): Centers the variable around its mean.

  • scale(): Centers the variable around its mean and divides it by its standard deviation.

  • standardize(): An alias for scale().

  • bs(): Create a Basis Spline, a.k.a. B-Spline.

[21]:
dm = design_matrices("scale(x)", data)
dm.common.as_dataframe().head()
[21]:
Intercept scale(x)
0 1.0 -0.732600
1 1.0 -0.829284
2 1.0 -0.806969
3 1.0 0.390720
4 1.0 1.588384
[22]:
# check mean is 0 and std is 1
print(np.std(dm.common.as_dataframe()["scale(x)"]))
print(np.mean(dm.common.as_dataframe()["scale(x)"]))
1.0
-2.2204460492503132e-17
[23]:
dm = design_matrices("bs(x, df=6)", data)
dm.common.as_dataframe()
[23]:
Intercept bs(x, df = 6)[0] bs(x, df = 6)[1] bs(x, df = 6)[2] bs(x, df = 6)[3] bs(x, df = 6)[4] bs(x, df = 6)[5]
0 1.0 0.281092 0.655237 0.063433 0.000237 0.000000 0.000000e+00
1 1.0 0.703742 0.086820 0.000757 0.000000 0.000000 0.000000e+00
2 1.0 0.737838 0.240873 0.004644 0.000000 0.000000 0.000000e+00
3 1.0 0.000000 0.000000 0.359380 0.502344 0.138276 1.039084e-07
4 1.0 0.000000 0.000000 0.005325 0.092999 0.470374 4.313026e-01
5 1.0 0.312950 0.631876 0.055013 0.000161 0.000000 0.000000e+00
6 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00
7 1.0 0.000000 0.296984 0.611161 0.090700 0.001155 0.000000e+00
8 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000e+00
9 1.0 0.000000 0.000008 0.380123 0.494494 0.125375 0.000000e+00

These transformations are known as stateful transformations because they remember the original values of parameters involved in the transformation (the mean and the standard deviation). This feature is critical when one wants to derive a design matrix based on new data, as required when evaluating the fit of a model on a new dataset.

Suppresing the default intercept#

By default, formulae includes an intercept in the resulting desing matrix. But this can be ommited. Just add a 0 to the model formula as in the following example.

[24]:
dm = design_matrices("0 + x", data)
dm.common.as_dataframe().head()
[24]:
x
0 -0.351447
1 -0.431129
2 -0.412738
3 0.574337
4 1.561390

"y1 ~ -1 + x" and "y1 ~ x - 1" are equivalent alternatives.

Operations between terms#

It is possible someone wants to use a term that is the result of adding two other terms, such as x + z. However, if we do "y ~ x + z", formulae will understand we want a model with two predictor terms, x, and z. Fortunately, there are two (equivalent) ways of telling formulae that we want a term that is the result of performing operations between other terms. This can be achieved either by using the I() wrapper or its shorthand {}.

[25]:
dm = design_matrices("I(x + z)", data)
dm.common.as_dataframe().head()
[25]:
Intercept I(x + z)
0 1.0 2.648553
1 1.0 2.568871
2 1.0 1.587262
3 1.0 3.574337
4 1.0 4.561390
[26]:
dm = design_matrices("{x / z}", data)
dm.common.as_dataframe().head()
[26]:
Intercept I(x / z)
0 1.0 -0.117149
1 1.0 -0.143710
2 1.0 -0.206369
3 1.0 0.191446
4 1.0 0.520463

As you may have noticed, the output shows I() instead of {}. That’s because {} is translated into I() internally.

Categorical responses#

Formulae can work with categorical responses with 2 or more levels. When the variable contains only two levels, formulae flags it as binary.

Recall y2 is a categorical variable with levels A, B and C. In this case, formulae set the binary flag to False since this variable contains more than two categories.

[27]:
dm = design_matrices("y2 ~ x", data)
dm.response
[27]:
ResponseVector
  name: y2
  kind: categoric
  length: 10
  binary: False
  levels: ['A', 'B', 'C']

To access the actual design vector do 'np.array(this_obj)'

The .design_vector array is a dummy based design matrix where each column represents a level of the response

[28]:
dm.response.design_vector
[28]:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0]])
[29]:
np.array(dm.response)
[29]:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0]])

It is also possible to override the default behavior and indicate the reference level with some syntax sugar. If you say response[level], formulae will interpret response as a binary variable and set the 'level' category as reference. For example

[30]:
dm = design_matrices("y2[B] ~ x", data)
dm.response
[30]:
ResponseVector
  name: y2
  kind: categoric
  length: 10
  binary: True
  success: B

To access the actual design vector do 'np.array(this_obj)'
[31]:
# 'y2' is equal to B in four cases
np.array(dm.response)
[31]:
array([0, 1, 1, 1, 0, 0, 0, 1, 0, 0])

If it is the case that the reference level contains spaces, we can wrap the response within quotes:

[32]:
dm = design_matrices("g['Group 3'] ~ x", data)
dm.response
[32]:
ResponseVector
  name: g
  kind: categoric
  length: 10
  binary: True
  success: Group 3

To access the actual design vector do 'np.array(this_obj)'

As with the predictors, the binary() function can be used on the LHS as well

[33]:
dm = design_matrices("binary(g, 'Group 3') ~ x", data)
dm.response
[33]:
ResponseVector
  name: binary(g, Group 3)
  kind: numeric
  length: 10

To access the actual design vector do 'np.array(this_obj)'

Note binary() outputs a numeric variable, so the response is recognized as numeric instead of categoric.

Group-specific effects#

formulae also handles group-specific effects. They are specified using the pipe operator |, similar to the lme4 R package. These terms are of the form (expr|factor) where the expr represents the effect and factor the group. This vignette from lme4 provides much better information on how this notation works. In a nutshell, group-specific intercepts are specified via (1|group) and group-specific slopes are specified via (variable|group). The latter also adds a group-specific intercept by default. This can be overriden doing (0 + variable|group).

First, let’s see a formula specifying a common predictor, x, and a group-specific intercept for each level of g.

[34]:
dm = design_matrices("y1 ~ x + (1|g)", data)
dm.group
[34]:
GroupEffectsMatrix with shape (10, 3)
Terms:
  1|g
    kind: intercept
    groups: ['Group 1', 'Group 2', 'Group 3']
    columns: 0:3

To access the actual design matrix do 'np.array(this_obj)'
[35]:
dm.common.as_dataframe()
[35]:
Intercept x
0 1.0 -0.351447
1 1.0 -0.431129
2 1.0 -0.412738
3 1.0 0.574337
4 1.0 1.561390
5 1.0 -0.357137
6 1.0 -0.453278
7 1.0 -0.039111
8 1.0 1.882748
9 1.0 0.549615
[36]:
np.array(dm.group)
[36]:
array([[0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 0, 1]])

If we now use (x|g) we get both the group-specific slope, as well as the intercept, because it is included by default.

[37]:
dm = design_matrices("y1 ~ x + (x|g)", data)
dm.group
[37]:
GroupEffectsMatrix with shape (10, 6)
Terms:
  1|g
    kind: intercept
    groups: ['Group 1', 'Group 2', 'Group 3']
    columns: 0:3
  x|g
    kind: numeric
    groups: ['Group 1', 'Group 2', 'Group 3']
    columns: 3:6

To access the actual design matrix do 'np.array(this_obj)'

We can access the different sub-matrices corresponding to each term as follows

[38]:
dm.group["1|g"] # same as before
[38]:
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
[39]:
dm.group["x|g"]
[39]:
array([[-0.        , -0.351447  , -0.        ],
       [-0.43112895, -0.        , -0.        ],
       [-0.        , -0.41273793, -0.        ],
       [ 0.        ,  0.57433652,  0.        ],
       [ 1.56139001,  0.        ,  0.        ],
       [-0.        , -0.35713698, -0.        ],
       [-0.        , -0.        , -0.45327849],
       [-0.        , -0.03911089, -0.        ],
       [ 0.        ,  1.88274763,  0.        ],
       [ 0.        ,  0.        ,  0.54961543]])

But as with the common part, this default intercept can be removed

[40]:
dm = design_matrices("y1 ~ x + (0 + x|g)", data)
dm.group # note there is no 1|x in the terms
[40]:
GroupEffectsMatrix with shape (10, 3)
Terms:
  x|g
    kind: numeric
    groups: ['Group 1', 'Group 2', 'Group 3']
    columns: 0:3

To access the actual design matrix do 'np.array(this_obj)'

As a final remark, we note that the expr part of a group-specific can also be a categorical variable or an interaction. Similar rules apply for the factor part.

Evaluating new data#

Both CommonEffectsMatrix and GroupEffectsMatrix are provided with a method that receives a data frame, with the same structure as the original data frame, and returns a new design matrix.

Suppose we have measured three variables, y, x and g on some subjects. y and x are numerics, while g is a categorical variable that represents some grouping. We also have a model with a common slope, and both varying intercepts and slopes, where we want to standardize the numeric predictor before entering the model. The formula would be y ~ scale(x) + (scale(x)|g).

[41]:
data = pd.DataFrame({
    "y": rng.normal(size=50),
    "x": rng.normal(loc=200, scale=20, size=50),
    "g": rng.choice(["A", "B", "C"], size=50)
})
[42]:
dm = design_matrices("y ~ scale(x) + (scale(x)|g)", data)
print(dm.common)
CommonEffectsMatrix with shape (50, 2)
Terms:
  Intercept
    kind: intercept
    column: 0
  scale(x)
    kind: numeric
    column: 1

To access the actual design matrix do 'np.array(this_obj)'
[43]:
print(dm.group)
GroupEffectsMatrix with shape (50, 6)
Terms:
  1|g
    kind: intercept
    groups: ['A', 'B', 'C']
    columns: 0:3
  scale(x)|g
    kind: numeric
    groups: ['A', 'B', 'C']
    columns: 3:6

To access the actual design matrix do 'np.array(this_obj)'

And now suppose we have new data for a set of new subjects. We can derive both common and group-specific design matrices for these new individuals without having to specify the model again. We can instead use the .evaluate_new_data() method. This method takes a data frame and returns either a new CommonEffectsMatrix or a new GroupEffectsMatrix depending on which object the method was called.

[44]:
data2 = pd.DataFrame({
    "x": rng.normal(loc=180, scale=20,size=10),
    "g": rng.choice(["A", "B", "C"], size=10)
})
[45]:
common_new = dm.common.evaluate_new_data(data2)
common_new
[45]:
CommonEffectsMatrix with shape (10, 2)
Terms:
  Intercept
    kind: intercept
    column: 0
  scale(x)
    kind: numeric
    column: 1

To access the actual design matrix do 'np.array(this_obj)'

And if we explore the design matrix for the common terms we notice the scaling of the new data used the mean and standard deviation x in the first dataset.

[46]:
common_new.as_dataframe()
[46]:
Intercept scale(x)
0 1.0 -1.111751
1 1.0 -1.113193
2 1.0 -0.820436
3 1.0 -2.196410
4 1.0 -2.131558
5 1.0 -1.244299
6 1.0 -1.076445
7 1.0 -0.859853
8 1.0 -0.012279
9 1.0 -0.351255

Something similar can be done with the group-specific part.

[47]:
group_new = dm.group.evaluate_new_data(data2)
group_new
[47]:
GroupEffectsMatrix with shape (10, 6)
Terms:
  1|g
    kind: intercept
    groups: ['A', 'B', 'C']
    columns: 0:3
  scale(x)|g
    kind: numeric
    groups: ['A', 'B', 'C']
    columns: 3:6

To access the actual design matrix do 'np.array(this_obj)'
[48]:
group_new["scale(x)|g"]
[48]:
array([[-0.        , -1.11175056, -0.        ],
       [-0.        , -1.11319326, -0.        ],
       [-0.        , -0.        , -0.82043629],
       [-2.19640995, -0.        , -0.        ],
       [-0.        , -2.13155813, -0.        ],
       [-1.24429866, -0.        , -0.        ],
       [-1.07644467, -0.        , -0.        ],
       [-0.        , -0.85985319, -0.        ],
       [-0.01227858, -0.        , -0.        ],
       [-0.35125497, -0.        , -0.        ]])