I am trying to duplicate the results from pure R code that uses lme4 glmer with Pandas -> R -> glmer. The original output is
%load_ext rpy2.ipython
%R library(lme4)
%R data("respiratory", package = "HSAUR2")
%R write.csv(respiratory, 'respiratory2.csv')
%R resp <- subset(respiratory, month > "0")
%R resp$baseline <- rep(subset(respiratory, month == "0")$status,rep(4, 111))
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = resp)
%R -o resp_lmer_summary resp_lmer_summary = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print resp_lmer_summary
print exp_res
The output is
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: resp
AIC BIC logLik deviance df.resid
446.6 487.6 -213.3 426.6 434
Scaled residuals:
Min 1Q Median 3Q Max
-2.5855 -0.3609 0.1430 0.3640 2.2119
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 3.779 1.944
Number of obs: 444, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.65460 0.77621 -2.132 0.0330 *
baselinegood 3.08897 0.59859 5.160 2.46e-07 ***
month.L -0.20348 0.27957 -0.728 0.4667
month.Q -0.02821 0.27907 -0.101 0.9195
month.C -0.35571 0.28085 -1.267 0.2053
treatmenttreatment 2.16620 0.55157 3.927 8.59e-05 ***
gendermale 0.23836 0.66606 0.358 0.7204
age -0.02557 0.01994 -1.283 0.1997
centre2 1.03850 0.54182 1.917 0.0553 .
...
On the other hand, when I read the file through Pandas, pass it to glmer to R through rmagic, I get
import pandas as pd
df = pd.read_csv('respiratory2.csv',index_col=0)
baseline = df[df['month'] == 0][['subject','status']].set_index('subject')
df['status'] = (df['status'] == 'good').astype(int)
df['baseline'] = df.apply(lambda x: baseline.ix[x['subject']],axis=1)
df['centre'] = df['centre'].astype(str)
%R -i df
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = df)
%R -o res res = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print res
Output
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +
(1 | subject)
Data: df
AIC BIC logLik deviance df.resid
539.2 573.7 -261.6 523.2 547
Scaled residuals:
Min 1Q Median 3Q Max
-3.8025 -0.4123 0.1637 0.4295 2.4482
Random effects:
Groups Name Variance Std.Dev.
subject (Intercept) 1.829 1.353
Number of obs: 555, groups: subject, 111
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.39252 0.60229 2.312 0.0208 *
baselinepoor -3.42262 0.46095 -7.425 1.13e-13 ***
month 0.12730 0.08465 1.504 0.1326
treatmenttreatment 1.59332 0.39981 3.985 6.74e-05 ***
gendermale 0.12915 0.49291 0.262 0.7933
age -0.01833 0.01480 -1.238 0.2157
centre2 0.70520 0.39676 1.777 0.0755 .
The results are somewhat different. When R reads the file itself, it turns month into something called "ordinal factor"; whereas from Pandas -> R this type is treated as numeric value, maybe that's the difference? I believe I was able to duplicate the derived column baseline correctly, I did have to turn status into 1/0 numeric value however, whereas pure R can work with this column as string (good/poor).
Note: Correction - I missed the filtering condition in Python part where only month>0 are taken. Once that is done
df = df[df['month'] > 0]
Then treatmenttreatment coefficient is 2.16, close to pure R. R still displays a positive baselinegood whereas Pandas -> R displays baselinepoor with negative coefficient, but I guess this is a minor difference.