0

I want to calculate logistic regression parameters using R's glm package. I'm working with python and using rpy2 for that. For some reason, when I'm running the glm function using R I get much faster results than by using rpy2. Do you know why the calculations using rpy2 is much slower? I'm using R - V2.13.1 and rpy2 - V2.0.8 Here is the code I'm using:

import numpy
from rpy2 import robjects as ro
import rpy2.rlike.container as rlc

def train(self, x_values, y_values, weights):
        x_float_vector = [ro.FloatVector(x) for x in numpy.array(x_values).transpose()]
        y_float_vector = ro.FloatVector(y_values)   
        weights_float_vector = ro.FloatVector(weights)
        names = ['v' + str(i) for i in xrange(len(x_float_vector))]
        d = rlc.TaggedList(x_float_vector + [y_float_vector], names + ['y'])
        data = ro.RDataFrame(d)
        formula = 'y ~ '
        for x in names:
            formula += x + '+'
        formula = formula[:-1]
        fit_res = ro.r.glm(formula=ro.r(formula), data=data, weights=weights_float_vector,  family=ro.r('binomial(link="logit")'))
user5497
  • 243
  • 1
  • 2
  • 10

1 Answers1

1

Without the full R code you are benchmarking against, it is difficult to precisely point out where the problem might be.

You might want to run this through a Python profiler to see where the bottleneck(s) is (are).

Finally, the current release for rpy2 is 2.2.6. Beside API changes, it is running faster and has (presumably) less bugs than 2.0.8.

Edit: From your comments I am now suspecting that you are calling your function in a loop, and a large fraction of the time is spent building R vectors (that might only have to be built once).

lgautier
  • 11,363
  • 29
  • 42
  • I'm using: glm(y~v1+v2+..., data=data) (in this case data is a data frame that was uploaded from a CSV file and contained the same data that was sent to the "train" function in python. – user5497 Jun 20 '12 at 14:05
  • I will also try checking the new version – user5497 Jun 20 '12 at 14:06
  • We've tried using the new version and get the same result (same speed). In addition the profiler shows that most of the time is spent in the last line (fit_res = ro.r.glm(formula=ro.r(formula), data=data, weights=weights_float_vector, family=ro.r('binomial(link="logit")'))) – user5497 Jul 17 '12 at 12:41
  • 1
    Without the R code to compare, it is harder to help identify the cause. – lgautier Jul 17 '12 at 17:22