3

I am trying to duplicate the results from pure R code that uses lme4 glmer with Pandas -> R -> glmer. The original output is

%load_ext rpy2.ipython
%R library(lme4)
%R data("respiratory", package = "HSAUR2")
%R write.csv(respiratory, 'respiratory2.csv')
%R resp <- subset(respiratory, month > "0")
%R resp$baseline <- rep(subset(respiratory, month == "0")$status,rep(4, 111))
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = resp)
%R -o resp_lmer_summary resp_lmer_summary = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print resp_lmer_summary
print exp_res

The output is

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +  
    (1 | subject)
   Data: resp

     AIC      BIC   logLik deviance df.resid 
   446.6    487.6   -213.3    426.6      434 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5855 -0.3609  0.1430  0.3640  2.2119 

Random effects:
 Groups  Name        Variance Std.Dev.
 subject (Intercept) 3.779    1.944   
Number of obs: 444, groups:  subject, 111

Fixed effects:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.65460    0.77621  -2.132   0.0330 *  
baselinegood        3.08897    0.59859   5.160 2.46e-07 ***
month.L            -0.20348    0.27957  -0.728   0.4667    
month.Q            -0.02821    0.27907  -0.101   0.9195    
month.C            -0.35571    0.28085  -1.267   0.2053    
treatmenttreatment  2.16620    0.55157   3.927 8.59e-05 ***
gendermale          0.23836    0.66606   0.358   0.7204    
age                -0.02557    0.01994  -1.283   0.1997    
centre2             1.03850    0.54182   1.917   0.0553 .  
...

On the other hand, when I read the file through Pandas, pass it to glmer to R through rmagic, I get

import pandas as pd
df = pd.read_csv('respiratory2.csv',index_col=0)
baseline = df[df['month'] == 0][['subject','status']].set_index('subject')
df['status'] = (df['status'] == 'good').astype(int)
df['baseline'] = df.apply(lambda x: baseline.ix[x['subject']],axis=1)
df['centre'] = df['centre'].astype(str)
%R -i df 
%R resp_lmer <- glmer(status ~ baseline + month + treatment + gender + age + centre + (1 | subject),family = binomial(), data = df)
%R -o res res = summary(resp_lmer)
%R -o exp_res exp_res = exp(fixef(resp_lmer))
print res

Output

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: status ~ baseline + month + treatment + gender + age + centre +  
    (1 | subject)
   Data: df

     AIC      BIC   logLik deviance df.resid 
   539.2    573.7   -261.6    523.2      547 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.8025 -0.4123  0.1637  0.4295  2.4482 

Random effects:
 Groups  Name        Variance Std.Dev.
 subject (Intercept) 1.829    1.353   
Number of obs: 555, groups:  subject, 111

Fixed effects:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.39252    0.60229   2.312   0.0208 *  
baselinepoor       -3.42262    0.46095  -7.425 1.13e-13 ***
month               0.12730    0.08465   1.504   0.1326    
treatmenttreatment  1.59332    0.39981   3.985 6.74e-05 ***
gendermale          0.12915    0.49291   0.262   0.7933    
age                -0.01833    0.01480  -1.238   0.2157    
centre2             0.70520    0.39676   1.777   0.0755 .  

The results are somewhat different. When R reads the file itself, it turns month into something called "ordinal factor"; whereas from Pandas -> R this type is treated as numeric value, maybe that's the difference? I believe I was able to duplicate the derived column baseline correctly, I did have to turn status into 1/0 numeric value however, whereas pure R can work with this column as string (good/poor).

Note: Correction - I missed the filtering condition in Python part where only month>0 are taken. Once that is done

df = df[df['month'] > 0]

Then treatmenttreatment coefficient is 2.16, close to pure R. R still displays a positive baselinegood whereas Pandas -> R displays baselinepoor with negative coefficient, but I guess this is a minor difference.

BBSysDyn
  • 4,389
  • 8
  • 48
  • 63
  • 1
    Yes, `month` is an ordered factor in R and consequently polynomial contrasts are used by `glmer`. In contrast `month` is a continuous variable in your rmagic output. – Roland Feb 20 '15 at 12:37
  • What is a polynomial contrast? Is that multiplying month byitself 2, 3, .. times? Could I simulate this by performing the multiplication myself in Pandas before handing data over to R? – BBSysDyn Feb 20 '15 at 13:23
  • take a look at `zapsmall(model.matrix(~f,d=data.frame(f=ordered(1:4))))` if you want to see the dummy variables established by the polynomial contrast. You should be able to substitute three numerical predictors with the corresponding values in the model for your `month` ordered variables. – Ben Bolker Feb 21 '15 at 22:33
  • PS the problem isn't "R vs. rpy2"; it's due to reading the data back in as numeric rather than ordered factor. (If you write and then read the data in R you'll lose information in the same way.) The brief browsing I did suggests that `pandas` doesn't have much infrastructure for dealing with categorical variables (I could be wrong!), so setting up the dummies yourself might be your best bet. – Ben Bolker Feb 21 '15 at 22:37

0 Answers0