How to find how the values have changed from the begining of their period?

Question

I have a dataframe with timeindex. There is a period (cycle) that starts every august. I want to calculate the difference between the value of each month and the value that was on the previous august (the beginning of its period). The aim is to know how much the values have changed during each cycle.

This is an small example of the data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                          '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                          '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                          '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                          '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                          '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                          '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                          '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                          '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                          '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                          '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                          '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                          '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                          '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                          '2023-05-01 00:00:00', '2023-06-01 00:00:00',
                          ], 
               'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})
df = df.set_index('date')

I have tried to get a dataframe with the same structure but all the values have the value that was on its corresponding august, and then I just need to subtract one dataframe from the other. But I couldn't find a way to do it neither.

Thank you vary much in advance.

score 0 · Accepted Answer · answered Jun 12 '23 at 09:48

If I understand you correctly, you can use .groupby() and then calculate the difference in each group:

df.index = pd.to_datetime(df.index)

tmp = (df.index.month == 8).cumsum()
out = (
    df[tmp != 0]
    .groupby(tmp[tmp > 0], group_keys=False)
    .apply(lambda x: x[["value1", "value2"]] - x[["value1", "value2"]].iloc[0])
).add_prefix('calculated_')

x = df.join(out)
print(x)

Prints:

              value1  value2  calculated_value1  calculated_value2
date                                                              
2021-03-01  0.513929     8.4                NaN                NaN
2021-04-01  0.447762     6.2                NaN                NaN
2021-05-01  0.377096     6.2                NaN                NaN
2021-06-01  0.355096     6.0                NaN                NaN
2021-07-01  0.271429     3.9                NaN                NaN
2021-08-01  0.291762     8.5           0.000000                0.0
2021-09-01  0.476762     8.3           0.185000               -0.2
2021-10-01  0.335929     5.3           0.044167               -3.2
2021-11-01  0.280429     5.6          -0.011333               -2.9
2021-12-01  0.283762     5.3          -0.008000               -3.2
2022-01-01  0.452762     9.6           0.031167               -2.3
2022-01-01  0.322929     6.2           0.031167               -2.3
2022-02-01  0.372262     8.0          -0.004500               -2.2
2022-02-01  0.287262     6.3          -0.004500               -2.2
2022-03-01  0.316929     6.9           0.025167               -1.6
2022-04-01  0.209262     4.8          -0.082500               -3.7
2022-05-01  0.407929     6.7           0.116167               -1.8
2022-06-01  0.254262     3.6          -0.037500               -4.9
2022-07-01  0.232096     3.0          -0.059667               -5.5
2022-08-01  0.264262     4.6           0.000000                0.0
2022-09-01  0.076096     2.3          -0.188167               -2.3
2022-10-01 -0.025238     1.3          -0.289500               -3.3
2022-11-01 -0.042238     1.0          -0.306500               -3.6
2023-01-01  0.017429     1.6          -0.246833               -3.0
2023-02-01 -0.036071     0.4          -0.300333               -4.2
2023-03-01 -0.094071     1.5          -0.358333               -3.1
2023-04-01 -0.071404     1.4          -0.335667               -3.2
2023-05-01  0.008096     2.2          -0.256167               -2.4
2023-06-01 -0.141571     1.2          -0.405833               -3.4
2023-12-01 -0.094904     0.3          -0.359167               -4.3

That works perfectly, thank you! – Ferran Jun 12 '23 at 14:55 — Ferran, Jun 12 '23 at 14:55

score 0 · Answer 2 · answered Jun 12 '23 at 09:56

There are several ways to solve your issues. The solution depends on how your data might looks like

There are some flaws in your code too. Here is my suggestion

You don't need to set_index('date') here.

Here is:

df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

The first line is to convert your date column to datetime format, and the second is to sort your data in ascending order.

The most common way is to extract subsample with august only then merge those value back to your main dataframe and perform the calculation. This might be long, but it will shows any potential error if there is in your data

The complete solution:

import pandas as pd
import numpy as np
from dateutil.relativedelta import *
from pandas.tseries.offsets import *

df = pd.DataFrame({'date': ['2022-01-01 00:00:00', '2022-02-01 00:00:00', 
                          '2021-03-01 00:00:00', '2021-04-01 00:00:00',
                          '2021-05-01 00:00:00', '2021-06-01 00:00:00',
                          '2021-07-01 00:00:00', '2021-08-01 00:00:00',
                          '2021-09-01 00:00:00', '2021-10-01 00:00:00',
                          '2021-11-01 00:00:00', '2021-12-01 00:00:00',
                          '2022-01-01 00:00:00', '2022-02-01 00:00:00',
                          '2022-03-01 00:00:00', '2022-04-01 00:00:00',
                          '2022-05-01 00:00:00', '2022-06-01 00:00:00',
                          '2022-07-01 00:00:00', '2022-08-01 00:00:00',
                          '2022-09-01 00:00:00', '2022-10-01 00:00:00',
                          '2022-11-01 00:00:00', '2023-12-01 00:00:00',
                          '2023-01-01 00:00:00', '2023-02-01 00:00:00',
                          '2023-03-01 00:00:00', '2023-04-01 00:00:00',
                          '2023-05-01 00:00:00', '2023-06-01 00:00:00',
                          ], 
               'value1': [0.452762281,0.372262281,0.513928948,0.447762281,
                            0.377095615,0.355095615,0.271428948,0.291762281,
                            0.476762281,0.335928948,0.280428948,0.283762281,
                            0.322928948,0.287262281,0.316928948,0.209262281,
                            0.407928948,0.254262281,0.232095615,0.264262281,
                            0.076095615,-0.025237719,-0.042237719,-0.094904385,
                            0.017428948,-0.036071052,-0.094071052,-0.071404385,
                            0.008095615,-0.141571052],
               'value2': [9.6,8,8.4,6.2,6.2,6,3.9,8.5,8.3,5.3,5.6,5.3,
                               6.2,6.3,6.9,4.8,6.7,3.6,3,4.6,2.3,1.3,1,0.3,
                               1.6,0.4,1.5,1.4,2.2,1.2]})

# cleaning data
df['date'] = pd.to_datetime(df['date'])
df.sort_values(by=['date'], ignore_index=True, inplace=True)

# creat sub dataframe with august value only
aug = df.loc[df['date'].dt.month==8].copy()
aug.rename(columns={'value1': 'augval1', 'value2': 'augval2', 'date':'date_aug'}, inplace=True)

# create equivalent august date for merge
df['date_aug'] = np.where(df['date'].dt.month < 8, df['date'] + YearBegin(-1) + MonthBegin(7), df['date'] + YearBegin(1) + MonthBegin(7))

# merge aug dataframe to df dataframe
new = pd.merge(df, aug, on='date_aug', how='left')

# perform calculation
new['chg_val1'] = new['value1'].diff()
new['chg_aug'] = new['value1'] - new['augval1']

How to find how the values have changed from the begining of their period?

2 Answers2