0

[Preface: I now realize I should've used a classification model (maybe decision tree) instead but I ended up using a linear regression model.]

I had a pandas dataframe as such:

enter image description here

And I want to predict audience score using genre, year, tomato-meter score. But as is constructed, the genres for each movie came in a list, so I felt the need to isolate each genre to pass each genre into my model as separate variables.

After doing such, my modified dataframe looks like this, with duplicate rows for each movie, but each genre element of that movie isolated (just one movie pulled from the dataframe to show):

enter image description here

Now, my question is, can I pass in this second dataframe as is to Patsy and statsmodel linear regression, or will the row duplication introduce bias into my model?

y1, X1 = dmatrices('Q("Audience Score") ~ Year + Q("Tomato-meter") + Genre',
                   data=DF2, return_type='dataframe')

In summary, looking for a way for patsy and my model to recognize treat each genre as separate variables.. but want to make sure I'm not fudging the numbers/model by passing in a dataframe in this format as the data (as not every movie as the same # of genres).

SpicyClubSauce
  • 4,076
  • 13
  • 37
  • 62

2 Answers2

1

I see two problems with the approach:

parameter estimates:

If there are different number of repeated observations, the weight for observations with multiple categories will be larger than observations with only a single category. This could be corrected by using weights in the linear model. Use WLS with weights equal to the inverse of the number of repetitions (or the square root of it ?). Weights are not available for other models like Poisson or Logit or GLM-Binomial. This will not make a larger difference for the parameter estimates, if the "pattern", i.e. the underlying parameters are not systematically different across movies with different number of categories.

Inference, standard error of parameter estimates:

All basic models like OLS, Poisson and so on assume that each row is an independent observation. The total number of rows will be larger than the number of actual observations and the estimated standard errors of the parameters will be underestimated. (We could use cluster robust standard errors, but I never checked how well they work with duplicate observations, i.e. response is identical across several observations.)

Alternative

As an alternative to repeating observations, I would encode the categories into non-exclusive dummy variables. For example, if we have three levels of the categorical variable, movie categories in this case, then we add a 1 in each corresponding column if the observation is "in" that category.

Patsy doesn't have premade support for this, so the design matrix for the movie category would need to be build by hand or as the sum of the individual dummy matrices (without dropping a reference category).

alternative model

This is not directly related to the issue of multiple categories in the explanatory variables.

The response variable movie ratings is bound to be between 0 and 100. A linear model will work well as a local approximation, but will not take into account that observed ratings are in a limited range and will not enforce it for prediction.

Poisson regression could be used to take the non-negativity into account, but wouldn't use the upper bound. Two alternatives that will be more appropriate are GLM with Binomial family and a total count for each observation set to 100 (maximum possible rating), or use a binary model, e.g. Logit or Probit, after rescaling the ratings to be between 0 and 1. The latter corresponds to estimating a model for proportions which can be estimated with the statsmodels binary response models. To have inference that is correct even if the data is not binary, we can use robust standard errors. For example

result = sm.Logity(y_proportion, x).fit(cov_type='HC0')

Josef
  • 21,998
  • 3
  • 54
  • 67
0

Patsy doesn't have any built-in way to separate out a "multi-category" like your Genre variable, and as far as I know there's no direct way to represent it in Pandas either.

I'd break the Genre into a bunch of booleans columns, one per category: Mystery = True/False, Comedy = True/False, etc. That fits better with both pandas and patsy's way of representing things.

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49