0

I'm using the patsy python package. I have a boolean dependent (y) variable, and some number of numerical explanatory variables. I'm hoping for patsy to treat my y variable as a categorical variable, and therefore produce a 1-hot encoding of the boolean data. However, even with a simple data frame and formula, it always produces two columns. This causes problems downstream in sklearn where certain classifiers need a single column. Here is an example:

>>> import pandas as pd
>>> import patsy
>>> df = pd.DataFrame({"y": [True, False, True, True], "x": [1, 1, 3, 4]})
>>> df
       y  x
0   True  1
1  False  1
2   True  3
3   True  4
>>> patsy.dmatrices("y ~ x", df)
(DesignMatrix with shape (4, 2)
y[False]  y[True]
       0        1
       1        0
       0        1
       0        1
Terms:
'y' (columns 0:2), DesignMatrix with shape (4, 2)
Intercept  x
        1  1
        1  1
        1  3
        1  4
Terms:
'Intercept' (column 0), 'x' (column 1))

Note how the y matrix has two columns.

How can I produce the result that I want, which is simply 1, 0, 1, 1, but using patsy and not simply converting the series to an integer using numpy or pandas to do this.

Migwell
  • 18,631
  • 21
  • 91
  • 160

1 Answers1

1

Not sure if a solution is still needed, and this is a hacky approach, but you can use patsy's categorical_to_int() function. It's a helper function for other design matrix building functions.

You just need to call the function inside the formula, with 3 positional arguments:

  • data (in your case, the y column),
  • a tuple of unique levels (with values listed in ascending numerical order, so False for 0, then True for 1),
  • and the required instance of the NAAction class.

Note: The function will map any missing values to -1.

import pandas as pd
import patsy
from patsy.categorical import categorical_to_int
from patsy.missing import NAAction

df = pd.DataFrame({"y": [True, False, True, True], "x": [1, 1, 3, 4]})

patsy.dmatrices("categorical_to_int(y, (False, True), NAAction()) ~ x", df)

Output:

(DesignMatrix with shape (4, 1)
   categorical_to_int(y, (False, True), NAAction())
                                                  1
                                                  0
                                                  1
                                                  1
   Terms:
     'categorical_to_int(y, (False, True), NAAction())' (column 0),
 DesignMatrix with shape (4, 2)
   Intercept  x
           1  1
           1  1
           1  3
           1  4
   Terms:
     'Intercept' (column 0)
     'x' (column 1))
AlexK
  • 2,855
  • 9
  • 16
  • 27