1

I got a dataframe like this:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
               ...                        ...
2019-12-30   -8.529724 2019-12-30 20:23:01.874996
2019-12-30   -8.563781 2019-12-30 20:48:28.125016
2019-12-30   -8.504211 2019-12-30 21:23:44.999996
2019-12-31   -8.460620 2019-12-31 09:39:31.873999
2019-12-31   -8.230092 2019-12-31 10:18:58.125014

[7983 rows x 2 columns]

and I want to calculate the averaged values of each morning and each afternoon based on hour time. By morning I mean the data is observed around 10:00:00, and 22:00:00 for afternoon. If there is no values on the morning/evening on this day, fill it with np.nan. For example, on 2007-01-01 we don't have any morning or evening values of sig. Then we fill it with two np.nan values. Then on 2007-01-02 we only have morning value, so we fill the evening value of 2007-01-02 with np.nan. SPECIFICALLY, for 2019-12-30, we have 3 evening values which are 2019-12-30 20:23:01.874996, 2019-12-30 20:48:28.125016 and 2019-12-30 21:23:44.999996. So we need to calculate the average value of -8.529724, -8.563781 and -8.504211. It's same for the last two datapoints on the morning of 2019-12-31, we need to average them, and fill the np.nan to the evening of 2019-12-31.

So ideally the final result would be:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
               ...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.345356 2019-12-31 09:39:31.873999
2019-12-31   nan       2019-12-31 22:00:00

It's fine if we round all hourtime to 10:00:00 or 22:00:00 like below:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 10:00:00
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:00:00
2007-01-03   -8.996480 2007-01-03 22:00:00
2007-01-04   -8.835958 2007-01-04 10:00:00
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 22:00:00
               ...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.460620 2019-12-31 10:00:00
2019-12-31   nan       2019-12-31 22:00:00

How can I do it? is there anybody who can help me? Thanks!

rpanai
  • 12,515
  • 2
  • 42
  • 64
Xu Shan
  • 175
  • 3
  • 11

1 Answers1

1

Use cut for defined 10 and 22 column by some thresholds, here is used 12 and 23 hours.

Then create MultiIndex by minimal and maximal years in MultiIndex.from_product, aggregate mean and add missing combinations by Series.reindex, last create hourtime column:

df['hourtime'] = pd.cut(df['hourtime'].dt.hour, bins=[0,12,23], labels=[10,22])

start = pd.Timestamp(year=df.index.year.min(), month=1, day=1)
end = pd.Timestamp(year=df.index.year.max(), month=12, day=31)
mux = pd.MultiIndex.from_product([pd.date_range(start, end), [10,22]],
                                  names=['datetime_doy','h'])

df = df.groupby([df.index, 'hourtime'])['sig'].mean().reindex(mux).reset_index(level=1)
df['hourtime'] = df.index + pd.to_timedelta(df.pop('h'), unit='H')
print (df)
                   sig            hourtime
datetime_doy                              
2007-01-01         NaN 2007-01-01 10:00:00
2007-01-01         NaN 2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 10:00:00
2007-01-02         NaN 2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:00:00
               ...                 ...
2019-12-29         NaN 2019-12-29 22:00:00
2019-12-30         NaN 2019-12-30 10:00:00
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.345356 2019-12-31 10:00:00
2019-12-31         NaN 2019-12-31 22:00:00

[9496 rows x 2 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252