2

Having a DataFrame with a timestamp column, thanks to groupby, pd.Grouper and a for loop, I am able to group rows by periods and keep track of the group label in the original DataFrame.

For instance, considering following DataFrame, and periods of 2 hours:

import pandas as pd
df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9],
                    'pressure': [1e5, 1.1e5, 0.95e5],
                    'location': ['Paris', 'Paris', 'Milan']},
                    index = [pd.Timestamp('2020/01/02 01:59:00'),
                             pd.Timestamp('2020/01/02 03:59:00'),
                             pd.Timestamp('2020/01/02 02:59:00')])
grps = df1.groupby(pd.Grouper(freq='2H', origin='start_day'))
for gr in grps:
    df1.loc[gr[1].index,'grp'] = gr[0]

Result is then:

df1
Out[23]: 
                     humidity  pressure location                 grp
2020-01-02 01:59:00       0.3  100000.0    Paris 2020-01-02 00:00:00
2020-01-02 03:59:00       0.8  110000.0    Paris 2020-01-02 02:00:00
2020-01-02 02:59:00       0.9   95000.0    Milan 2020-01-02 02:00:00

Intending to manage large Datasets, I wonder if there is not a way to get rid of this for loop? Is there a function or a parameter in groupby to retrieve the original DataFrame, only with a new column embedding the name of the label?

Thanks for your help. Bests,

pierre_j
  • 895
  • 2
  • 11
  • 26

1 Answers1

1

Use GroupBy.transform with any column name:

grps = df1.groupby(pd.Grouper(freq='2H', origin='start_day'))
for gr in grps:
    print (gr)
    df1.loc[gr[1].index,'grp'] = gr[0]

df1['new'] = grps['humidity'].transform(lambda x: x.name)
print (df1)
                     humidity  pressure location                 grp  \
2020-01-02 01:59:00       0.3  100000.0    Paris 2020-01-02 00:00:00   
2020-01-02 03:59:00       0.8  110000.0    Paris 2020-01-02 02:00:00   
2020-01-02 02:59:00       0.9   95000.0    Milan 2020-01-02 02:00:00   

                                    new  
2020-01-02 01:59:00 2020-01-02 00:00:00  
2020-01-02 03:59:00 2020-01-02 02:00:00  
2020-01-02 02:59:00 2020-01-02 02:00:00  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi, thanks for the tip. I get however an error message: 'TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]' My pandas version is 1.1.3. Is it the same as yours? – pierre_j Jan 29 '21 at 09:47
  • 1
    @pierre_j - My version is `pandas 1.2.0` – jezrael Jan 29 '21 at 09:51
  • 1
    thanks @jezrael. I confirm I upgraded to pandas 1.2.1, and it works without the error message! Thanks again! – pierre_j Jan 29 '21 at 11:23