Using patsy.dmatrices to split my data into y,x and I am losing observations. Ex:
formula = 'target ~ v1 + v2 + v3'
y, x = patsy.dmatrices(formula, df, return_type = 'dataframe')
My df.shape
is ~ 54,000,000 length, however following x/y split, my y.shape
and x.shape
are clocking in around 43,000,000 observations. I've checked my df.isna().sum()
and I'm sitting at 0 across the board. Can someone explain what is going on, or the fix for this issue? I've performed the split on the same dataframe with an alternate variable, e.g.
formula = 'target ~ v99 + v2 + v3'
y, x = patsy.dmatrices(formula, df, return_type = 'dataframe')
and had no issues with the dimensions.