I'd like to analyze some daily data by hydrologic year: From 1 September to 31 August. I've created a synthetic data set with:
import pandas as pd
t = pd.date_range(start='2015-01-01', freq='D', end='2021-09-03')
df = pd.DataFrame(index = t)
df['hydro_year'] = df.index.year
df['hydro_year'].loc[df.index.month >= 9] += 1
df['id'] = df['hydro_year'] - df.index.year[0]
df['count'] = 1
Note that in reality I do not have a hydro_year
column so I do not use groupby
. I would expect the following to resample by hydrologic year:
print(df['2015-09-01':].resample('12M').agg({'hydro_year':'mean','id':'mean','count':'sum'}))
But the output does not align:
| | hydro_year | id | count |
|---------------------+------------+---------+-------|
| 2015-09-30 00:00:00 | 2016 | 1 | 30 |
| 2016-09-30 00:00:00 | 2016.08 | 1.08197 | 366 |
| 2017-09-30 00:00:00 | 2017.08 | 2.08219 | 365 |
| 2018-09-30 00:00:00 | 2018.08 | 3.08219 | 365 |
| 2019-09-30 00:00:00 | 2019.08 | 4.08219 | 365 |
| 2020-09-30 00:00:00 | 2020.08 | 5.08197 | 366 |
| 2021-09-30 00:00:00 | 2021.01 | 6.00888 | 338 |
However, if I start a day earlier, then things do align, except the first day is 'early' and dangling alone...
| | hydro_year | id | count |
|---------------------+------------+----+-------|
| 2015-08-31 00:00:00 | 2015 | 0 | 1 |
| 2016-08-31 00:00:00 | 2016 | 1 | 366 |
| 2017-08-31 00:00:00 | 2017 | 2 | 365 |
| 2018-08-31 00:00:00 | 2018 | 3 | 365 |
| 2019-08-31 00:00:00 | 2019 | 4 | 365 |
| 2020-08-31 00:00:00 | 2020 | 5 | 366 |
| 2021-08-31 00:00:00 | 2021 | 6 | 365 |
| 2022-08-31 00:00:00 | 2022 | 7 | 3 |