0

Using patsy.dmatrices to split my data into y,x and I am losing observations. Ex:

formula = 'target ~ v1 + v2 + v3'
y, x = patsy.dmatrices(formula, df, return_type = 'dataframe')

My df.shape is ~ 54,000,000 length, however following x/y split, my y.shape and x.shape are clocking in around 43,000,000 observations. I've checked my df.isna().sum() and I'm sitting at 0 across the board. Can someone explain what is going on, or the fix for this issue? I've performed the split on the same dataframe with an alternate variable, e.g.

formula = 'target ~ v99 + v2 + v3'
y, x = patsy.dmatrices(formula, df, return_type = 'dataframe')

and had no issues with the dimensions.

  • can you pass the `NA_action='raise'` parameter to make sure about the nulls, also check what's missing, is there anything notable about the values? – Chris Dec 30 '20 at 21:10
  • 2
    Figured it out. Had an NaN string that was not being recognized in my isna() . Thank you, Chris. – Joshua Paiva Dec 31 '20 at 17:05

0 Answers0