0

In R, I can use a group_by and later create a column based on the result of that group_by without needing to join with another dataframe because the "group_by" function can work with not only summarise, but also mutate. Example in creating the max_date by each name:

library(tidyverse)
library(lubridate)

df = tibble(
  date = seq(ymd('2020-01-01'), ymd('2020-03-01'), '1 month') %>% rep(3),
  name = c(rep('Romulo', 3), rep('Daniel', 3), rep('Fernando', 3))
)

df %>% 
  group_by(name) %>% 
  mutate(max_date = max(date))

enter image description here

In pandas, as I know, we would have to use "agg" for date, create a new dataframe and than join with the "df" referred before:

import pandas as pd

df = pd.DataFrame({
   'date': ['2020-01-01', '2020-02-01', '2020-03-01'] * 3,
   'name': [
       'Romulo', 'Romulo', 'Romulo',
       'Daniel', 'Daniel', 'Daniel',
       'Fernando', 'Fernando', 'Fernando'
       ]
})

max_date_df = df.groupby(
    'name', as_index=False
).agg({'date': 'max'}).rename({'date': 'max_date'}, axis=1)

df.merge(max_date_df, on='name', how='left')

enter image description here

Is there an easier way to do that in pandas (without the need to join dataframes)?

1 Answers1

0

You can use transform():

df["max_date"] = df.groupby("name")['date'].transform('max')

Output:

         date      name    max_date
0  2020-01-01    Romulo  2020-03-01
1  2020-02-01    Romulo  2020-03-01
2  2020-03-01    Romulo  2020-03-01
3  2020-01-01    Daniel  2020-03-01
4  2020-02-01    Daniel  2020-03-01
5  2020-03-01    Daniel  2020-03-01
6  2020-01-01  Fernando  2020-03-01
7  2020-02-01  Fernando  2020-03-01
8  2020-03-01  Fernando  2020-03-01
langtang
  • 22,248
  • 1
  • 12
  • 27