0

A large data frame (a couple of million rows, a few thousand columns) is created Pandas in python. This data frame is to be passed to R using PyRserve. This has to be quick - few seconds at most.

There is a to_json function in pandas. Is to and from json conversation for such large objects the only way? is it OK for such large objects?

I can always write it to disk and read it (fast using fread, and that it what I have done), but what is the best way to do this?

user1971988
  • 845
  • 7
  • 22
  • 4
    This question appears to be off-topic because it is an enhancement request to pyrserve (better to ask via [googlegroups](https://groups.google.com/forum/#!forum/pyrserve)). – Andy Hayden Aug 26 '13 at 12:34

1 Answers1

2

Without having tried it out, to_json seems to be a very bad idea, getting worse with larger dataframes as this has a lot of overhead, both in writing and reading the data.

I'd recommend using rpy2 (which is supported directly by pandas) or, if you want to write something to disk (maybe because the dataframe is only generated once) you can use HDF5 (see this thread for more information on interfacing pandas and R using this format).

filmor
  • 30,840
  • 6
  • 50
  • 48
  • Thanks @filmor, but I have to use pyRserve. I am now currently writing to disk, but I was hoping there was a way of directly passing a dataframe in pandas to R via pyRserve. – user1971988 Aug 26 '13 at 10:45
  • @user1971988 it appears pyRserve is not the right tool for the job (a search of their googlegroup and github repo reports no results for pandas) at least for the moment. – Andy Hayden Aug 26 '13 at 12:35
  • Thanks @AndyHayden. I guess I will have to wait for an in-memory solution. – user1971988 Aug 27 '13 at 06:46