1

(patsy v0.4.1, python 3.5.0)

I would like to use patsy (ideally through statsmodels) to build a design matrix for regression.

The patsy-style formula that I would like to fit is

response ~ 0 + category

where category is a two-level categorical variable. The 0 + ... is supposed to indicate that I do not want the implicit intercept term.

The design matrix that I expect has a single column with zeros and ones indicating whether category has the base-level (0) or the other level (1).

The following code:

import pandas as pd
import patsy

df = pd.DataFrame({'category': ['A', 'B'] * 3})

patsy.dmatrix('0 + category', data=df)

Outputs:

DesignMatrix with shape (6, 2)
  category[A]  category[B]
            1            0
            0            1
            1            0
            0            1
            1            0
            0            1
  Terms:
    'category' (columns 0:2)

which is singular and not what I want.

When I instead run

import pandas as pd
import patsy

df = pd.DataFrame({'category': ['A', 'B'] * 3})

patsy.dmatrix('category', data=df)

the output is

DesignMatrix with shape (6, 2)
  Intercept  category[T.B]
          1              0
          1              1
          1              0
          1              1
          1              0
          1              1
  Terms:
    'Intercept' (column 0)
    'category' (column 1)

which is correct for the model which includes an intercept, but still not what I want.

Is the output without an intercept the intended behavior? If so, why? Am I just confused about how this design matrix is supposed to work with standard coding?

I know that I can edit the design matrix to make my regression work the way I intend, but if this is a bug I'd like to see it fixed in patsy.

bsmith89
  • 223
  • 2
  • 6
  • I also submitted an issue to pydata/patsy. . – bsmith89 Mar 11 '16 at 17:00
  • This is a duplicate issue for the second part of https://github.com/pydata/patsy/issues/60 Currently patsy provides always a full set of columns that is non-singular if all level combinations are present in the data. (A user defined constrast should or might be able to work around this.) `0 +` or `- 1` means no *explicit* intercept, the intercept is still added implicitly. – Josef Mar 11 '16 at 18:19
  • @user333700 , are you saying that formulae in patsy can still have an implicit intercept even if I include `0 +`? I wonder why that would be the case...? – bsmith89 Mar 12 '16 at 00:35
  • official answer: because this is by far the most common usecase. (inofficial answer: because of imitation of R, I guess. "magic is better than explicit") – Josef Mar 12 '16 at 00:44

0 Answers0