I'm using patsy to create matrices. But I get strange behavior when None or Nan values are in the dataset. As seen below instead of just dropping the None row it creates additional columns with 1's and 0's.
import numpy as np
import pandas as pd
import patsy as pt
df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,3)]), columns=['y','X'])
In[60]: df
Out[60]:
y X
0 1 3
1 2 6
2 4 2
3 6 3
In[61]: pt.dmatrices('y ~ X', df)
Out[61]:
(DesignMatrix with shape (4, 1)
y
1
2
4
6
Terms:
'y' (column 0),
DesignMatrix with shape (4, 2)
Intercept X
1 3
1 6
1 2
1 3
Terms:
'Intercept' (column 0)
'X' (column 1))
In[62]: df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,None)]), columns=['y','X'])
In[63]: pt.dmatrices('y ~ X', df)
Out[63]:
(DesignMatrix with shape (3, 4)
y[1] y[2] y[4] y[6]
1 0 0 0
0 1 0 0
0 0 1 0
Terms:
'y' (columns 0:4),
DesignMatrix with shape (3, 3)
Intercept X[T.3] X[T.6]
1 1 0
1 0 1
1 0 0
Terms:
'Intercept' (column 0)
'X' (columns 1:3))
Why is patsy returning additional columns when I add a None value?