5

Is there in pandas data frame an equivalent to using 'by' in R data.table?

for example in R I can do:

DT = data.table(x = c('a', 'a', 'a', 'b', 'b', 'b'), y = rnorm(6))
DT[, z := mean(y[1:2]), by = x]

Is there something similar in pandas?

rnorthcott
  • 77
  • 1
  • 5
  • 1
    You can check [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) or [here](http://stackoverflow.com/questions/30244952/python-pandas-create-new-column-with-groupby-sum) – akrun Jan 08 '17 at 07:56
  • 1
    what is output? – jezrael Jan 08 '17 at 07:56
  • 2
    IIUC you need `df.groupby('x')['y'].mean()`. – jezrael Jan 08 '17 at 07:58
  • That'll do it, thanks – rnorthcott Jan 08 '17 at 08:05
  • On a tangent... A different way of doing this simply to do the work using R, then push the result into the python environment. Check out what is possible using the [rpy2](https://rpy2.readthedocs.io/en/version_2.8.x/introduction.html#examples) package. If you are using Jupyter notebooks, you can also use the [`rmagic`](https://ipython.org/ipython-doc/2/config/extensions/rmagic.html) cells to perform things in R and Python next to each other, _pushing/pulling_ variables between each of the two environments within the same notebook - very cool! – n1k31t4 Jun 18 '17 at 14:11

1 Answers1

5

If we need to get the similar output as in data.table where we want to take the first elements of 'y' grouped by 'x' and create a new column 'z', then

mean1 = lambda x: x.head(2).mean()
df['z'] = df['y'].groupby(df['x']).transform(mean1)
print(df)
#   x         y         z
#0  a  1.329212  0.279589
#1  a -0.770033  0.279589
#2  a -0.316280  0.279589
#3  b -0.990810 -1.030813
#4  b -1.070816 -1.030813
#5  b -1.438713 -1.030813

Using the OP's code for data.table in R

library(data.table)
DT[, z := mean(y[1:2]), by = x]
DT
#   x         y          z
#1: a  1.329212  0.2795895
#2: a -0.770033  0.2795895
#3: a -0.316280  0.2795895
#4: b -0.990810 -1.0308130
#5: b -1.070816 -1.0308130
#6: b -1.438713 -1.0308130

data

import pandas as pd
import numpy as np
from numpy import random

np.random.seed(seed=24)
df = pd.DataFrame({'x': ['a', 'a', 'a', 'b', 'b', 'b'], 
               'y': random.randn(6)})


DT <- structure(list(x = c("a", "a", "a", "b", "b", "b"),
y = c(1.329212, 
-0.770033, -0.31628, -0.99081, -1.070816, -1.438713)), .Names = c("x", 
"y"), class = c("data.table", "data.frame"), 
  row.names = c(NA, -6L))
akrun
  • 874,273
  • 37
  • 540
  • 662