4

I am using the rpy2 package to bring some R functionality to python. The functions I'm using in R need a data.frame object, and by using rlike.TaggedList and then robjects.DataFrame I am able to make this work.

However I'm having performance issues, when comparing to the exact same R functions with the exact same data, which led me to try and use the rpy2 low level interface as mentioned here - http://rpy.sourceforge.net/rpy2/doc-2.3/html/performances.html

So far I have tried:

  1. Using TaggedList with FloatSexpVector objects instead of numpy arrays, and the DataFrame object.
  2. Dumping the TaggedList and DataFrame classes by using a dictionary like this:

    d = dict((var_name, var_sexp_vector) for ...)
    dataframe = robjects.r('data.frame')(**d)
    

Both did not get me any noticeable speedup.

I have noticed that DataFrame objects can get a rinterface.SexpVector in their constructor , so I have thought of creating a such a named vector, but I have no idea on how to put in the names (in R I know its just names(vec) = c('a','b'...)).

How do I do that? Is there another way? And is there an easy way to profile rpy itself, so I could know where the bottleneck is?

EDIT:

The following code seem to work great (x4 faster) on newer rpy (2.2.3)

data = ro.r('list')([ri.FloatSexpVector(x) for x in vectors])[0]
data.names = ri.StrSexpVector(vector_names)

However it doesn't on version 2.0.8 (last one supported by windows), since R cant seem to be able to use the names: "Error in eval(expr, envir, enclos) : object 'y' not found"

Ideas?

EDIT #2: Someone did the fine job of building a rpy2.3 binary for windows (python 2.7), the mentioned works great with it (almost x6 faster for my code)

link: https://bitbucket.org/breisfeld/rpy2_w32_fix/issue/1/binary-installer-for-win32

itai
  • 1,566
  • 1
  • 12
  • 25

1 Answers1

1

Python can be several times faster than R (even byte-compiled R), and I managed to perform operations on R data structures with rpy2 faster than R would have. Sharing the relevant R and rpy2 code would help make more specific advice (and improve rpy2 if needed).

In the meantime, SexpVector might not be what you want; it is little more than an abstract class for all R vectors (see class diagram for rpy2.rinterface). ListSexpVector might be more appropriate:

import rpy2.rinterface as ri
ri.initr()
l = ri.ListSexpVector([ri.IntSexpVector((1,2,3)),
                       ri.StrSexpVector(("a","b","c")),])

An important detail is that R lists are recursive data structures, and R avoids a catch 22-type of situation by having the operator "[[" (in addittion to "["). Python does not have that, and I have not (yet ?) implemented "[[" as a method at the low-level.

Profiling in Python can be done with the module stdlib module cProfile, for example.

lgautier
  • 11,363
  • 29
  • 42
  • 1
    Thanks a lot for the answer. I know about cProfile ofcourse, I was asking for the easiest way to profile the c-extension part of rpy. Two issues about the solution - first in my windows machines I am using older rpy, which doesnt have ListSexpVector, and second (testing this on my linux machine) it doesn't seem to work as my data is a list of FloatSexpVectors and I'm getting from R runtime error "arguments imply differing number of rows" – itai Jul 18 '12 at 10:44
  • Check the rpy2 mailing list today. There is a link to a contributed build of a recent rpy2 for Win7. – lgautier Jul 18 '12 at 18:50
  • Thats great news! I was already working on that myself and this saved me quite some work.. thanks for the heads up! – itai Jul 19 '12 at 11:34