I have a question about crossed effects linear mixed models in statsmodels. Specifically, I can see two ways of representing my data and I'm not sure which is appropriate. Any advice appreciated!
My data are as below. I wish to ascertain whether the objective quality of a book ('good' or 'bad') predicts the score the book is assigned. Titles are nested within the quality variable, but titles and raters are crossed. (This is fake data so I'm not worried about models converging.)
rater title quality score
john book_1 good 0.600833333
frank book_2 bad 0.683020833
emma book_3 good 0.653645833
john book_4 bad 0.6528125
frank book_5 good 0.6040625
emma book_1 good 0.600833333
john book_2 bad 0.522
frank book_3 good 0.600833333
emma book_4 bad 0.619464286
john book_5 good 0.600833333
frank book_1 good 0.57125
emma book_2 bad 0.6296875
john book_3 good 0.607205882
frank book_4 bad 0.61203125
emma book_5 good 0.600833333
One way to analyse this data is to take quality as my independent variable, score as my dependent variable, rater as my grouping variable, and use variance components to capture the crossed effects on title. This gives:
import statsmodels.api as sm
import statsmodels.fomula.api as smf
md = smf.mixedlm('score ~ quality', vc_formula = {"title":"0 + title"}, groups = data['rater'], data = data).fit().summary()
Model summary:
Mixed Linear Model Regression Results
===========================================================
Model: MixedLM Dependent Variable: score
No. Observations: 15 Method: REML
No. Groups: 3 Scale: 0.0007
Min. group size: 5 Log-Likelihood: 22.1997
Max. group size: 5 Converged: Yes
Mean group size: 5.0
-----------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------------
Intercept 0.620 0.001 841.098 0.000 0.618 0.621
quality[T.good] -0.015 0.013 -1.158 0.247 -0.041 0.011
title Var 0.001
===========================================================
This intuitively seems to me to be the correct approach. It gives me a p-value and coefficient for my IV and accounts for the crossed effects.
However, I've been advised elsewhere that crossed effects like this should be specified by treating the dataset as one group and specifying variation entirely using variance components. Thus:
data['groups'] = 1
md = smf.mixedlm('score ~ 1', vc_formula = {"rater":"0 + rater", "title":"0 + title", "quality":"0 + quality"}, groups = data['groups'], data = data).fit().summary()
Yielding:
Mixed Linear Model Regression Results
=====================================================
Model: MixedLM Dependent Variable: score
No. Observations: 15 Method: REML
No. Groups: 1 Scale: 0.0013
Min. group size: 15 Log-Likelihood: 24.4023
Max. group size: 15 Converged: No
Mean group size: 15.0
-----------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept 0.612 0.017 35.011 0.000 0.577 0.646
quality Var 0.000
rater Var 0.000 0.020
title Var 0.000
=====================================================
This model offers me no p-value, different coefficients, and different model test statistics all around. Now, I'm either wrong in my use of both models, or I'm wrong in my use of one of them. Can anyone advise me which is the case? Thanks.