I have the following dataset shown below. Any value between 500 & 900 were categorized as A, while values between 900 & ~1500 were mixed between A and B. I want to find the probability of getting A, B, and C at any value of x where x is my independent variable and A,B,C are my dependent variables. It seems to be a good fit for multinomial logistic regression. I believe the number of observations for each dependent variable is sufficient. If multinomial log regression is appropriate, I wish to uses Python's scikit learn logistic regression module to obtain my probability of A, B, and C at any value of x but I am not sure how to approach this using that module.
-
It looks like you have what's called a mixture distribution. A, B, and C each have their own distributions, and what you observe is p(A) p(x | A) + p(B) p(x | B) + p(C) p(x | C). Typically (not necessarily) one applies a so-called expectation-maximization (EM) algorithm to find the mixing weights p(A), p(B), p(C) and parameters for p(x | A), p(x | B), p(x | C). However these are very general comments and what you should do depends strongly on the details of your problem. Probably you should take this to stats.stackexchange.com for discussion. – Robert Dodier Nov 13 '17 at 19:08
1 Answers
Personally, it looks like an all right candidate for logistic regression, but the fact that it looks 1-dimensional with overlapping may make it hard to separate along those parts. I’m mainly here to answer the second part of your question which can be generalized to pretty much any other classifier within scikit-learn.
I recommend looking at the scikit-learn section on SGDClassifier since it has a simple example right below the attribute list, but replace the SGDClassifier part with the LogisticRegression class instead. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier
Here’s also the documentation for LogisticRegression: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression

- 61
- 3