I'm using the patsy
python package. I have a boolean dependent (y) variable, and some number of numerical explanatory variables. I'm hoping for patsy to treat my y variable as a categorical variable, and therefore produce a 1-hot encoding of the boolean data. However, even with a simple data frame and formula, it always produces two columns. This causes problems downstream in sklearn where certain classifiers need a single column. Here is an example:
>>> import pandas as pd
>>> import patsy
>>> df = pd.DataFrame({"y": [True, False, True, True], "x": [1, 1, 3, 4]})
>>> df
y x
0 True 1
1 False 1
2 True 3
3 True 4
>>> patsy.dmatrices("y ~ x", df)
(DesignMatrix with shape (4, 2)
y[False] y[True]
0 1
1 0
0 1
0 1
Terms:
'y' (columns 0:2), DesignMatrix with shape (4, 2)
Intercept x
1 1
1 1
1 3
1 4
Terms:
'Intercept' (column 0), 'x' (column 1))
Note how the y matrix has two columns.
How can I produce the result that I want, which is simply 1, 0, 1, 1
, but using patsy and not simply converting the series to an integer using numpy or pandas to do this.