Is there a way to access R data frame column names in python/rpy2?

Question

I have an R data frame, saved in Database02.Rda. Loading it

import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")

works fine. However:

print(robjects.r.names("df"))

yields

NULL

Also, as an example, column 214 (213 if we count starting with 0) is named REGION.

print(robjects.r.table(robjects.r["df"][213]))

works fine:

Region 1   Region 2   ...
    9811       3451   ...

but we should also be able to do

print(robjects.r.table("df$REGION"))

This, however, results in

df$REGION 
        1

(which it does also for column names that do not exist at all); also:

print(robjects.r.table(robjects.r["df"]["REGION"]))

gives an error:

TypeError: SexpVector indices must be integers, not str

Now, the docs say, names can not be used for subsetting in python. Am I correct to assume that the column names are not imported whith the rest of the data when loading the data frame with python/rpy2? Am I thus correct that the easiest way to access them is to save and load them as a seperate list and construct a dict or so in python mapping the names to the column index numbers? This does not seem very generic, however. Is there a way to extract the column names directly?

The versions of R, python, rpy2 I use are: R: 3.2.2 python: 3.5.0 rpy2: 2.7.8

score 5 · Accepted Answer · edited Jul 29 '17 at 17:24

5

When doing the following, you are loading whatever objects are Database02.Rda into R's "global environment".

import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")

robjects.globalenv is an Environement. You can list its content with:

tuple(robjects.globalenv.keys())

Now I am understanding that one of your objects is called df. You can access it with:

df = robjects.globalenv['df']

if df is a list or a data frame, you can access its named elements with rx2 (the doc is your friend here again). To get the one called REGION, do:

df.rx2("REGION")

To list all named elements in a list or dataframe that's easy:

tuple(df.names)

edited Jul 29 '17 at 17:24

Franck Dernoncourt

77,520
72
342
501

answered Mar 03 '16 at 23:51

lgautier

11,363
29
42

Thanks for the answer. I would like to add that `df.rx2()` does not return the contents of the column themselves but rather replaces them by some kind of unique identifiers (integers between 1 and the number of unique elements in the column). For instance, `robjects.r.head(df.rx("REGION"))` gives `[ 15, 18, 9, 15, 15, 15]` instead of ` Region 1, Region 4, Region 9, Region 1, ...`. It does this even for columns with int values which renders any data analysis useless. In fact, I have no idea what the rx2() method does; I could not find a help page for it. – 0range Mar 04 '16 at 02:30
This is a problem of factors and levels, not necessarily present in any such dataset but, as it happened, in mine. Sorry for the confusion in my earlier comment above. To resolve, do `as_character = robjects.r['as.character']` and then `as_character(df.rx2("REGION"))` or `as_character(robjects.r["df"][colnames.index("REGION")])`. For variables that should be numeric (in my case this error, numeric data as characters, "10" instead of 10, was already present in the Rda file as I found out in the meantime), do `as_numeric(as_character(df.rx2("NUMERIC_VAR")))` – 0range Mar 04 '16 at 05:09
The mapping of R factors to Python is quite to the R implementation, and this is not without its quirks. A lot of them are also present when only working R, I think, but there might be a better way to do it than currently handled by rpy2... – lgautier Mar 04 '16 at 23:49
1

@lgautier `dfr = dfr.rx2(co) # where co = ["a", "b"]` does not work for me. Should it? If not, how do I select multiple columns by name? – The Unfun Cat Jun 30 '16 at 13:38
With the conversion rules shipping with rpy2 `co` will have to be either an R vector, or a Python scaler of a type that can be turned to an R vector. Use `rpy2.robjects.vectors.StrVector` to build `co`. – lgautier Jul 02 '16 at 20:47

score 3 · Answer 2 · answered Oct 22 '17 at 15:33

If you run R code in python, the global environment answer will not work. But kudos to @lgautier the creator/maintainer of this package. In R the dollar sign $ is used frequently. This is what I learned:

print(pamk_clusters$pamobject$clusinfo)

will not work, and its equivalent

print(pamk_clusters[["pamobject"]][["clusinfo"]])

also will not work ... however, after some digging in the "man"

http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style

Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.

This works as expected

print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))

I commented in the forums about "man" clarity:

https://bitbucket.org/rpy2/rpy2/issues/436/acessing-dataframe-elements-using-rpy2

I am using rpy2 on Win7 with ipython. To help others dig through the formatting, here is a setup that seems to work:

import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr

base = importr('base')
utils = importr('utils')
utils.chooseCRANmirror(ind=1)

cluster = importr('cluster')
stats = importr('stats')
#utils.install_packages("fpc")
fpc = importr('fpc')

import pickle
with open ('points', 'rb') as fp:
    points = pickle.load(fp) 
# data above is stored as binary object
# online:  http://www.mshaffer.com/arizona/dissertation/points

import rpy2.robjects.numpy2ri as npr   
npr.activate()

k = robjects.IntVector(range(3, 8))   # r-syntax  3:7   # I expect 5
pamk_clusters = fpc.pamk(points,k)

print( base.summary(pamk_clusters) )
base.print( base.summary(pamk_clusters) )

utils.str(pamk_clusters)

print(pamk_clusters$pamobject$clusinfo)
base.print(pamk_clusters$pamobject$clusinfo)

print(pamk_clusters[["pamobject"]][["clusinfo"]])
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))

pam_clusters = cluster.pam(points,5)        # much slower
kmeans_clusters = stats.kmeans(points,5)    # much faster

utils.str(kmeans_clusters)

print(kmeans_clusters.rx2("cluster"))

R has been a standard for statistical computing for nearly 25 years, based on a forty-year old S - back when computing efficiency mattered a lot. https://en.wikipedia.org/wiki/R_(programming_language)

Again @lgautier, thank you for making R more readily accessible within Python

Is there a way to access R data frame column names in python/rpy2?

2 Answers2

http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style

Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.