0

I'm using patsy to create matrices. But I get strange behavior when None or Nan values are in the dataset. As seen below instead of just dropping the None row it creates additional columns with 1's and 0's.

import numpy as np
import pandas as pd
import patsy as pt

df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,3)]), columns=['y','X'])
In[60]: df
Out[60]: 
   y  X
0  1  3
1  2  6
2  4  2
3  6  3
In[61]: pt.dmatrices('y ~ X', df)
Out[61]: 
(DesignMatrix with shape (4, 1)
   y
   1
   2
   4
   6
   Terms:
     'y' (column 0),
 DesignMatrix with shape (4, 2)
   Intercept  X
           1  3
           1  6
           1  2
           1  3
   Terms:
     'Intercept' (column 0)
     'X' (column 1))
In[62]: df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,None)]), columns=['y','X'])
In[63]: pt.dmatrices('y ~ X', df)
Out[63]: 
(DesignMatrix with shape (3, 4)
   y[1]  y[2]  y[4]  y[6]
      1     0     0     0
      0     1     0     0
      0     0     1     0
   Terms:
     'y' (columns 0:4),
 DesignMatrix with shape (3, 3)
   Intercept  X[T.3]  X[T.6]
           1       1       0
           1       0       1
           1       0       0
   Terms:
     'Intercept' (column 0)
     'X' (columns 1:3))

Why is patsy returning additional columns when I add a None value?

rsgmon
  • 1,892
  • 4
  • 23
  • 35

1 Answers1

1

If I understood correctly, numpy array does not treat None as nan so the pandas dataframe takes that column as object. Since it's not a numerical column, patsy is trying to make a matrix for a categorical variable.

You can either skip the np.array and construct the matrix with:

df = pd.DataFrame([(1,3),(2,6),(4,2),(6,None)], columns=['y','X'])

Or you can just pass np.nan instead of None:

df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,np.nan)]), columns=['y','X'])

Result will be:

(DesignMatrix with shape (3L, 1L)
   y
   1
   2
   4
   Terms:
     'y' (column 0),
 DesignMatrix with shape (3L, 2L)
   Intercept  X
           1  3
           1  6
           1  2
   Terms:
     'Intercept' (column 0)
     'X' (column 1))
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Thanks, I think the second answer should be: df = pd.DataFrame(np.array([(1,3),(2,6),(4,2),(6,np.nan)]), columns=['y','X']) throws NameError: name 'nan' is not defined – rsgmon Apr 06 '16 at 16:34
  • You are right, Canopy imports them automatically so I forget to include np sometimes. Editing now. – ayhan Apr 06 '16 at 17:35