0

So, I know that in R you can provide data for a logistic regression in this form:

model <- glm( cbind(count_1, count_0) ~ [features] ..., family = 'binomial' )

Is there a way to do something like cbind(count_1, count_0) with sklearn.linear_model.LogisticRegression? Or do I actually have to provide all those duplicate rows? (My features are categorical, so there would be a lot of redundancy.)

user
  • 621
  • 1
  • 9
  • 21

1 Answers1

0

If they are categorical - you should provide binarized version of it. I don't know how that code in R works, but you should binarize your categorical feature always. Because you have to emphasize that each value of your feature is not related to other one, i.e. for feature "blood_type" with possible values 1,2,3,4 your classifier must learn that 2 is not related to 3, and 4 is not related to 1 in any sense. These is achieved by binarization.

If you have too many features after binarization - you can reduce dimensionality of binarized dataset by FeatureHasher or more sophisticated methods like PCA.

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • Perhaps my question was unclear. I know how to make a dummy matrix. I was asking about how to, rather than sending in rows with indicator variable 1 and 0, instead sum over all identical rows, and send in (80 1's, 10, 0's), rather than 90 rows with all the same features. – user Apr 21 '16 at 17:07
  • @Erin, Hmm, still I don't understand what you mean. Maybe you want to use sparse-matrix? By "row" you mean separate sample of your dataset? – Ibraim Ganiev Apr 21 '16 at 17:23
  • I thought sparsity referred to the features, not the outcome. I found a way to do it with statsmodels instead of sklearn [here](http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/glm.html). – user Apr 22 '16 at 17:51