Edit: I now have a candidate solution to my question (see toy example below) -- if you can think of something more robust, please let me know.
I just found out about python's patsy package for creating design matrices from R-style formulas, and it looks great. My question is this: given a patsy formula, e.g. "(x1 + x2 + x3)**2", is there an easy way to create a design matrix containing the derivative with respect to a particular variable, e.g. "x1"?
Here's a toy example:
import numpy as np
import pandas as pd
import patsy
import sympy
import sympy.parsing.sympy_parser as sympy_parser
n_obs = 200
df = pd.DataFrame(np.random.uniform(size=(n_obs, 3)), columns=["x1", "x2", "x3"])
df.describe()
design_matrix = patsy.dmatrix("(I(7*x1) + x2 + x3)**2 + I(x1**2) + I(x1*x2*x3)", df)
design_matrix.design_info.column_names
## ['Intercept', 'I(7 * x1)', 'x2', 'x3', 'I(7 * x1):x2', 'I(7 * x1):x3', 'x2:x3', 'I(x1 ** 2)', 'I(x1 * x2 * x3)']
x1, x2, x3 = sympy.symbols("x1 x2 x3")
def diff_wrt_x1(string):
return str(sympy.diff(sympy_parser.parse_expr(string), x1))
colnames_to_differentiate = [colname.replace(":", "*").replace("Intercept", "1").replace("I", "")
for colname in design_matrix.design_info.column_names]
derivatives_wrt_x1 = [diff_wrt_x1(colname) for colname in colnames_to_differentiate]
def get_column(string):
try:
return float(string) * np.ones((len(df), 1)) # For cases like string == "7"
except ValueError:
return patsy.dmatrix("0 + I(%s)" % string, df) # For cases like string == "x2*x3"
derivative_columns = tuple(get_column(derivative_string) for derivative_string in derivatives_wrt_x1)
design_matrix_derivative = np.hstack(derivative_columns)
design_matrix_derivative[0] # Contains [0, 7, 0, 0, 7*x2, 7*x3, 0, 2*x1, x2*x3]
design_matrix_derivative_manual = np.zeros_like(design_matrix_derivative)
design_matrix_derivative_manual[:, 1] = 7.0
design_matrix_derivative_manual[:, 4] = 7*df["x2"]
design_matrix_derivative_manual[:, 5] = 7*df["x3"]
design_matrix_derivative_manual[:, 7] = 2*df["x1"]
design_matrix_derivative_manual[:, 8] = df["x2"] * df["x3"]
np.all(np.isclose(design_matrix_derivative, design_matrix_derivative_manual)) # True!
The code generates a design matrix with columns [1, 7*x1, x2, x3, 7*x1*x2, 7*x1*x3, x2*x3, x1^2, x1*x2*x3]
.
Suppose I want a new formula which differentiates design_matrix with respect to x1. The desired result is a matrix of the same shape as design_matrix
, but whose columns are [0, 7, 0, 0, 7*x2, 7*x3, 0, 2*x1, x2*x3]
. Is there a programmatic way to do that? I've tried searching the patsy docs as well as stackoverflow and I don't see anything. Of course I can create the derivative matrix manually, but it would be great to have a function that does it (and that doesn't have to be updated when I change the formula to, say, "(x1 + x2 + x3 + x4)**2 + I(x1**3)"
).