0

I have the following dataframe:

Id, country, date
1, ar, 2019-01-01
1, ar, , 2019-02-01
1, ar, 2019-03-01
1, it, , 2019-01-01
1, it, , 2019-02-01
1, it, 2019-03-01
1, it, , 2019-04-01
1, it, 2019-03-01
2, ar, 2019-01-01
2, ar, , 2019-02-01
2, ar, 2019-03-01
2, it, , 2019-01-01
2, it, , 2019-02-01
3, it, 2019-03-01
3, it, , 2019-04-01
4, it, 2019-05-01

I need to group by Id, country and calculate the difference beetwen the dates (in months) for each group.

I tried:

df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()

But it get the diff in days. I need the diff in months. I think dividing 'daysdiff' by 30 it's not accurate because of months have different numbers of days...and leap-years...

Any help is welcome!

Mato
  • 45
  • 4

1 Answers1

0

I adapted this approach to your case.

Basically you have to deal with NaT values. I've chosen to treat them as 0.

And round the month to a integer, if you desire so.

As in your sample, there is a duplicated row: "1", "it", "2019-03-01"

This row results in 7, 1, it, 2019-03-01, 0 days, 0 (because it was considered sorted unique rows as input)

For this case, it seems to work, though I 've not tested in another ones.

import pandas as pd

df = pd.DataFrame(columns=["id", "country", "date"]
    , data=[
    ["1", "ar", "2019-01-01"],
    ["1", "ar", "2019-02-01"],
    ["1", "ar", "2019-03-01"],
    ["1", "it", "2019-01-01"],
    ["1", "it", "2019-02-01"],
    ["1", "it", "2019-03-01"],
    ["1", "it", "2019-04-01"],
    ["1", "it", "2019-03-01"],
    ["2", "ar", "2019-01-01"],
    ["2", "ar", "2019-02-01"],
    ["2", "ar", "2019-03-01"],
    ["2", "it", "2019-01-01"],
    ["2", "it", "2019-02-01"],
    ["3", "it", "2019-03-01"],
    ["3", "it", "2019-04-01"],
    ["4", "it", "2019-05-01"]
])
df["date"] = pd.to_datetime(df["date"])

df['daysdiff'] = df.sort_values('date').groupby(['id','country'])['date'].diff()
df['monthsdiff'] = (
    df
    .sort_values('date')
    .groupby(['id','country'])['date']
    .diff()
    # 365.25 [days/year] / (12 [months/year]) = 30.4375 [days/month]
    .div(pd.Timedelta(days=365.25/12), fill_value="0")
    .round()
    .astype(int)
    )
print(df)
#    id country       date daysdiff  monthsdiff
# 0   1      ar 2019-01-01      NaT           0
# 1   1      ar 2019-02-01  31 days           1
# 2   1      ar 2019-03-01  28 days           1
# 3   1      it 2019-01-01      NaT           0
# 4   1      it 2019-02-01  31 days           1
# 5   1      it 2019-03-01  28 days           1
# 6   1      it 2019-04-01  31 days           1
# 7   1      it 2019-03-01   0 days           0
# 8   2      ar 2019-01-01      NaT           0
# 9   2      ar 2019-02-01  31 days           1
# 10  2      ar 2019-03-01  28 days           1
# 11  2      it 2019-01-01      NaT           0
# 12  2      it 2019-02-01  31 days           1
# 13  3      it 2019-03-01      NaT           0
# 14  3      it 2019-04-01  31 days           1
# 15  4      it 2019-05-01      NaT           0
viniciusrf1992
  • 313
  • 1
  • 7
  • Thanks for the help! Running your code, I get the following error: `UFuncTypeError: ufunc 'true_divide' cannot use operands with types dtype('O') and dtype(' – Mato Feb 17 '21 at 18:02