17

While using statsmodels, I am getting this weird error: ValueError: endog must be in the unit interval. Can someone give me more information on this error? Google is not helping.

Code that produced the error:

"""
Multiple regression with dummy variables. 
"""

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df = pd.read_csv('cost_data.csv')
df.columns = ['Cost', 'R(t)', 'Day of Week']
dummy_ranks = pd.get_dummies(df['Day of Week'], prefix='days')
cols_to_keep = ['Cost', 'R(t)']
data = df[cols_to_keep].join(dummy_ranks.ix[:,'days_2':])
data['intercept'] = 1.0

print(data)

train_cols = data.columns[1:]
logit = sm.Logit(data['Cost'], data[train_cols])

result = logit.fit()

print(result.summary())

And the traceback:

Traceback (most recent call last):
  File "multiple_regression_dummy.py", line 20, in <module>
    logit = sm.Logit(data['Cost'], data[train_cols])
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/statsmodels/discrete/discrete_model.py", line 404, in __init__
    raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.
Edward Yu
  • 400
  • 1
  • 4
  • 13
  • 1
    Perhaps check this condition that generates this error: if (self.__class__.__name__ != 'MNLogit' and not np.all((self.endog >= 0) & (self.endog <= 1))): raise ValueError("endog must be in the unit interval.") – DmitryK Jul 09 '15 at 16:29
  • What's your `Cost` data? Logit requires that the dependent variable (endog) is in the unit interval. If you want logistic regression with values in another interval, then you need to transform your values so that they are in the the unit interval. However, Logit does not require that the `endog` are 0, 1 integers, so we can use it for proportions. – Josef Jul 09 '15 at 18:06
  • Ah `Cost` is not in the unit interval. Any idea why Logit requires this? – Edward Yu Jul 09 '15 at 19:15
  • The underlying distribution of Logit is a Bernoulli distribution that takes on values 0 and 1. This can be extended to any values between 0 and 1 but the functions are not defined outside of the unit interval. If you have a positive dependent variable and an exponential mean function then the Poisson distribution can be used, even if the data is continous. For unbound continuous data the usual model is OLS. – Josef Jul 09 '15 at 19:22

3 Answers3

28

I got this error when my target column had values larger than 1. Make sure your target column is between 0 and 1 (as is required for a Logistic Regression) and try again. For example, if you have target column with values 1-5, make 4 and 5 the positive class and 1,2,3 the negative class. Hope this helps.

user5323012
  • 281
  • 3
  • 3
3

It seems like you followed the same logistic regression tutorial that I did: http://blog.yhat.com/posts/logistic-regression-and-python.html

I ended up getting the same Value Error when I fit my logistic regression, and the trick I needed to get it running was making sure to drop all rows of my data with missing values (N/A or np.nan).

This can be done with the pandas function pandas.notnull() as follows :

data = data[pd.notnull(data['Cost'])],

data = data[pd.notnull(data['R(t)'])],

...

and so on until all your variables have the same amount of values to work with.

Hope this helps someone else!

rnso
  • 23,686
  • 25
  • 112
  • 234
CodingCody
  • 31
  • 1
1

I had the same problem: I change the model from a Classification to a Regression one (I was using a Classification Model .logit in a Regression problem)

You can still use StatsModel, but with OLS, for example, instead of logit. Logit (Logistic Regression) is for Classification problems, but here it seems it is a Regression one. Using OLS, could solve the problem