0

I'm using patsy to fit regressions with statsmodels using the formula api.

My problem is that my design matrix is singular because patsy creates (locally?) redundant interactions of categoricals.

import patsy
import pandas as pd
data = [('y',[2,5,6]),
        ('c1',['a','a','b']),
        ('c2',['g','f','g'])]
df = pd.DataFrame.from_items(data)#([y,c1,c2],columns=['y','c1','c2'])
formula = "y ~C(c1):C(c2)-1"
y,X = patsy.dmatrices(formula,df,return_type='dataframe')
print (X)

C(c1)[a]:C(c2)[f]   C(c1)[b]:C(c2)[f]   C(c1)[a]:C(c2)[g]   C(c1)[b]:C(c2)[g]
0   0.0 0.0 1.0 0.0
1   1.0 0.0 0.0 0.0
2   0.0 0.0 0.0 1.0

I would like to exclude the second column since c1 doesn't have value b when c2 has the value f

Artturi Björk
  • 3,643
  • 6
  • 27
  • 35

1 Answers1

2

Patsy interprets C(c1):C(c2) as meaning "I want to estimate the effect of every combination of c1 and c2 values". If some of those combinations don't appear in your data, then they can't be estimated, so giving you a singular matrix at least points out the problem...

If you want to estimate effects for just the combinations that exist, one easy way is to make a new variable that takes on a different value for each combination of c1 and c2. The reason this works is that patsy then will infer that the set of possible values is exactly the ones that actually appear -- it has no way to know that b.f could have happened:

In [1]: df["c1_and_c2"] = df["c1"] + "." + df["c2"]

In [2]: patsy.dmatrix("c1_and_c2 - 1", df)
Out[2]: 
DesignMatrix with shape (3, 3)
  c1_and_c2[a.f]  c1_and_c2[a.g]  c1_and_c2[b.g]
               0               1               0
               1               0               0
               0               0               1
  Terms:
    'c1_and_c2' (columns 0:3)
Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49