9

I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:

import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df

And if we're in IPython:

%load_ext rmagic
%R -i df

For some reason the ro.globalenv route is slightly slower than the rmagic route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:

  1. Even with just 1GB of data, the transfer is rather slow.
  2. If I understand correctly, this creates two copies of the dataframe in memory: one in Python, and one in R. That means I'll have just doubled my memory requirements, and I haven't even gotten to running statistical tests!

Is there any way to:

  1. transfer a large dataframe between Python and R more quickly?
  2. Access the same object in memory? I suspect this asking for the moon.
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
jeffalstott
  • 2,643
  • 4
  • 28
  • 34
  • 1
    That's an interesting question - I usually end up writing my data to the disk and then read them again in R. Needless to say, this is far from efficient. However `python` and `R` are completely different languages. It's already amazing that something like `rpy` is possible in python. I doubt that it's possible to have some data frame data structure that works for both python and R without the need of major transformations. Looking forward to answers, though. – cel May 03 '15 at 09:28
  • Can you write to `.RData` file from Pandas? – Roman Luštrik May 03 '15 at 10:00
  • Probably not without converting to a `R` data frame first. – cel May 03 '15 at 10:20

2 Answers2

6

rpy2 is using a conversion mechanism that is trying to avoid copying objects when moving between Python and R. However, this is currently only working in the direction R -> Python.

Python has an interface called the "buffer interface" that is used by rpy2 and that lets it minimize the number of copies for the C-level compatible between R and Python (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - the doc seems outdated as the __array_struct__ interface is no longer the primary choice).

There is no equivalent to the buffer interface in R, and the current concern holding me back from providing an equivalent functionality in rpy2 is the handling of borrowed references during garbage collection (and the lack of time to think sufficiently carefully about it).

So in summary there is a way to share data between Python and R without copying but this will require to have the data created in R.

lgautier
  • 11,363
  • 29
  • 42
  • Thanks! Will this work for a Pandas `DataFrame`? That is, creating a `data.frame` in R and then sending it to Python to use as a `DataFrame`? What would the relevant commands be? – jeffalstott May 04 '15 at 04:08
  • 1
    Looking at the code for `pandas2ri.ri2py_dataframe` and `numpy2ri.ri2py_list`, it looks like this does _not_ happen by default for sending a `data.frame` to Python? Is that correct? – jeffalstott May 04 '15 at 16:44
  • `pandas2ri.ri2py_dataframe` is first using the `numpy` converter, and will try to turn the R list (R data frames inherit from lists) into a numpy data structure using `numpy.rec.fromarrays`. An alternative would be to first create a numpy `recarray` and populate it using `numpy.asarray(column_in_R_dataframe)`. – lgautier May 07 '15 at 02:33
  • I don't quite follow. It sounds like you're saying that no, `ri2py_dataframe` doesn't handle the data copying as we would want. But I don't see how the proposed alternative accomplishes the task. Won't creating the `recarray` just also create a copy? – jeffalstott May 07 '15 at 09:26
  • First create a initial minimal `recarray` matching the data frame, then populate each of its cells (columns) with the result of `asarray`. – lgautier May 07 '15 at 12:04
4

Currently, feather seems to be the most efficient option for data-interchange between DataFrame of R and pandas.

TurtleIzzy
  • 997
  • 7
  • 14