13

I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?

Nearly all commands I try fail. For example:

  • I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:

In: r.plot(df.a, df.b) # df is pandas DataFrame

yields:

Out: rpy2.rinterface.NULL

resulting in the plot:

enter image description here

As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).

  • If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:

    In: r.hist(df.a)
    Out: 
    ...
    vectors.pyc in <genexpr>((x,))
        293         if l < 7:
        294             s = '[' + \
    --> 295                 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
        296                 ']'
        297         else:
    
    vectors.pyc in p_str(x, max_width)
        287                     res = x
        288                 else:
    --> 289                     res = "%s..." % (str(x[ : (max_width - 3)]))
        290             return res
        291 
    
    TypeError: slice indices must be integers or None or have an __index__ method
    

And resulting in this plot:

enter image description here

Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.

EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.

  • I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:

    com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
    ----> 1 com.convert_to_r_dataframe(mydf)
    in convert_to_r_dataframe(df, strings_as_factors)
        275     # FIXME: This doesn't handle MultiIndex
        276 
    --> 277     for column in df:
        278         value = df[column]
        279         value_type = value.dtype.type
    
    TypeError: iteration over non-sequence
    

How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?

EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer @dale's comment:

import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy

# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)

The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.

The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.

When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:

In [12]: X = np.array([0,1,2,3,4])

In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2

In [15]: from rpy2.robjects import r

In [16]: import rpy2.robjects.numpy2ri

In [17]: import pandas.rpy.common as com

In [18]: from rpy2.robjects.packages import importr

In [19]: from rpy2.robjects.lib import grid

In [20]: from rpy2.robjects.lib import ggplot2


In [21]: rpy2.robjects.numpy2ri.activate()

In [22]: from numpy import *

In [23]: import scipy

In [24]: r.assign("x", X)
Out[24]: 
<Array - Python:0x592ad88 / R:0x6110850>
[       0,        1,        2,        3,        4]

In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[       3,        5,        4,        6,        7]

In [27]: %R plot(x,y)

There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.

Thanks.

3 Answers3

7

[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]

As @dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).

An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).

Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.

An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.

from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
    if isinstance(obj, pandas.core.frame.DataFrame):
        od = OrderedDict()
        for name, values in obj.iteritems():
            if values.dtype.kind == 'O':
                od[name] = rpy2.robjects.vectors.StrVector(values)
            else:
                od[name] = rpy2.robjects.conversion.py2ri(values)
        return rpy2.robjects.vectors.DataFrame(od)
    elif isinstance(obj, pandas.core.series.Series):
        # converted as a numpy array
        res = py2ri_orig(obj) 
        # "index" is equivalent to "names" in R
        if obj.ndim == 1:
            res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
        else:
            res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
        return res
    else:
        return py2ri_orig(obj) 
rpy2.robjects.conversion.py2ri = conversion_pydataframe

Now the following code will "just work":

r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).

I also note that you are importing ggplot2, without using it. Currently the conversion will have to be explicitly requested. For example:

p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
    ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()
Community
  • 1
  • 1
lgautier
  • 11,363
  • 29
  • 42
  • Your code does not work for me - here's my complete example and its output http://pastebin.com/index/tAFG7dUV -- it complains now about datatype 'Series' not being convert-able. Any ideas? –  Feb 14 '13 at 20:55
  • If I add `activate()` it works but when I try it for a long dataframe, the error `rpy2.rinterface.RRuntimeError: Error in plot.window(...) : need finite 'xlim' values` occurs. It never works for any real dataframe. –  Feb 14 '13 at 21:03
  • I must have only looked at the first error-causing column and moved on when I fixed that one. The error message tells that rpy2 does not know how to convert objects of class `pandas.core.series.Series`. An `elif isinstance(obj, pandas.core.series.Series):` before `else:` and conversion code would trivially fix it. Since conversion of pandas data frames is now part of the rpy2 codebase (will be in release 2.3.3), this is now a bug report (https://bitbucket.org/lgautier/rpy2/issue/118/converion-of-pandas-series-missing). – lgautier Feb 14 '13 at 21:08
  • I was able to handle Series like you say but it's still broken downstream of that. It cannot handle nan values, so I can use `dropna()` to get rid of those. But even then, `r.plot(df)` never gives something reasonable on my dfs. It plots crazy things with weird labels, and when I try to get rid of the labels by passing `xlab="", ylab=""` to `r.plot`, it says `rpy2.rinterface.RRuntimeError: Error in plot.default(...) : formal argument "xlab" matched by multiple actual arguments` –  Feb 14 '13 at 21:12
  • Which `activate()` are you referring to ? (the snippet of code in the answer does not have any, and there are two in the current code base: one in the numpy converter, one in the pandas converter). Either it is working (your first sentence in the comment), or it is not (your last sentence in the same comment) ;-) . `print(rpy2.robjects.r.summary(rpy2.robjects.conversion.py2ri(data))` would tell you a bit about what the conversion is returning.` – lgautier Feb 14 '13 at 21:13
  • If you are now handling pandas' Series, then your pastebin code is no longer the version you are using. I suspect that `Series` are time series, and conversion to R might mean picking one of the R ways to represent them. The message `formal argument "xlab" matched by multiple actual arguments` comes from R, and tells that you did not called the function plot properly. – lgautier Feb 14 '13 at 21:22
  • I just added specific handling of pandas' `Series` objects (in the answer and in rpy2's codebase) – lgautier Feb 16 '13 at 11:50
  • 1
    In `conversion_pydataframe`, what is `original_conversion`? – unutbu Jan 04 '15 at 01:51
  • @unutbu : copy/paste accident during a late edit (earlier versions of the answer did not have it). It is `py2ri_orig`. Note that the conversion system has changed a little with the 2.5.x series of rpy2 (now using single dispatch). – lgautier Jan 04 '15 at 05:29
6

You need to pass in the labels explicitly when calling the r.plot function.

r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")

When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:

> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"

which is why you're getting the crazy labels.

A lot of times it's saner to use rpy2 to only push data back and forth.

r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)

rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)

http://nbviewer.ipython.org/4734581/

Dale
  • 4,480
  • 1
  • 19
  • 13
  • Thank you for your answer, but how can I get around the error I get when trying to call ``com.convert_to_r_dataframe(mydf)``? That seems to be independent of the plot labeling issue –  Feb 08 '13 at 14:55
  • also, how do you define ``%R`` in ipython? –  Feb 08 '13 at 14:56
  • Post an example dataframe or notebook. – Dale Feb 08 '13 at 17:24
  • %R is using the RMagic ipython extension. http://ipython.org/ipython-doc/dev/config/extensions/rmagic.html – Dale Feb 08 '13 at 17:24
  • thank you. I edited my post with complete code and an example csv file that hopefully should clarify the problem. –  Feb 08 '13 at 19:58
  • also, I don't want to rely on ipython etc. since I want my scripts to be executables as just "python script.py" - so if preferable I don't want to have to count on ``%R`` from ipython as useful as it is. Since I am plotting a numeric column of a pandas dataframe and dataframes of pandas are supposedly interchangeable with R, it seems like it should be able to handle ``r.plot(my_pandas_df.numeric_col)`` calls –  Feb 08 '13 at 20:07
5

use rpy. the conversion is part of pandas so you don't need to do it yoursef http://pandas.pydata.org/pandas-docs/dev/r_interface.html

In [1217]: from pandas import DataFrame

In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
   ......:                index=["one", "two", "three"])
   ......:

In [1219]: r_dataframe = com.convert_to_r_dataframe(df)

In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>
jassinm
  • 7,323
  • 3
  • 33
  • 42
  • 3
    we added in 0.10.1 ability to export in HDFStore, so that rhdf5 can read - see http://pandas.pydata.org/pandas-docs/stable/io.html#external-compatibility – Jeff Feb 02 '13 at 03:29
  • This actually doesn't work... I get: ``275 # FIXME: This doesn't handle MultiIndex 276 --> 277 for column in df: 278 value = df[column] 279 value_type = value.dtype.type`` –  Feb 05 '13 at 23:07
  • @Jeff: The conversion doesn't work and it turns out even the most basic rpy2 calls to R do not work, see above edits –  Feb 06 '13 at 04:02