I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:
import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df
And if we're in IPython:
%load_ext rmagic
%R -i df
For some reason the ro.globalenv
route is slightly slower than the rmagic
route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:
- Even with just 1GB of data, the transfer is rather slow.
- If I understand correctly, this creates two copies of the dataframe in memory: one in Python, and one in R. That means I'll have just doubled my memory requirements, and I haven't even gotten to running statistical tests!
Is there any way to:
- transfer a large dataframe between Python and R more quickly?
- Access the same object in memory? I suspect this asking for the moon.