2

I would like to subtract two date columns and get their difference in months unit instead of days but haven't been able to.

Data:

import polars as pl
from datetime import datetime

test_df = pl.DataFrame(
    {
        "dt_end": [datetime(2022, 1, 1).date(), datetime(2022, 1, 2).date()],
        "dt_start": [datetime(2021, 5, 7).date(), datetime(2020, 7, 8).date()],
    }
)

This gives difference in Days

test_df.with_columns(
    diff_months = pl.col('dt_end') - pl.col('dt_start')
)

output:

dt_end  dt_start    diff_months
date    date    duration[ms]
2022-01-01  2021-05-07  239d
2022-01-02  2020-07-08  543d

I have tried below code for months but that doesn't work

test_df.with_columns(
    diff_months = pl.duration(months = pl.col('dt_end') - pl.col('dt_start'))
)

UPDATE

Have also tried below code but that also gives me a different value from the excel.

test_df.with_columns(
    diff_months = (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year()) * 12 +
                    (pl.col('dt_end').dt.month().cast(pl.Int32) - pl.col('dt_start').dt.month().cast(pl.Int32)) 
)

output:

dt_end  dt_start    diff_months
date    date    i32
2022-01-01  2021-05-07  8
2022-01-02  2020-07-08  18

UPDATE 2:

I think this is the manual way of doing it by using 30.42 as the avg number of days in month

test_df.with_columns(
    diff_months = pl.col('dt_end') - pl.col('dt_start')
).with_columns(
    diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int8)
)

output

dt_end  dt_start    diff_months
date    date    i8
2022-01-01  2021-05-07  7
2022-01-02  2020-07-08  17

UPDATE 3: Snapshot of actual data with desired Reference excel value

# Data
snapshot_df = pl.DataFrame(
    {
        "Reporting Month": [datetime(2000, 7, 20).date(), datetime(2000, 8, 20).date(), datetime(2000, 9, 20).date()],
        "Origination date": [datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date(), datetime(1999, 12, 19).date()],
        "Excel_Reference_Age": [7,8,9]
    }
)

comparison

snapshot_df.with_columns(
    diff_months = pl.col('Reporting Month') - pl.col('Origination date')
).with_columns(
    diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int16),

    diff_months_1 = (pl.col('Reporting Month').dt.year() - pl.col('Origination date').dt.year()) * 12 +
                    (pl.col('Reporting Month').dt.month().cast(pl.Int32) - pl.col('Origination date').dt.month().cast(pl.Int32)) ,

    diff_months_2 = pl.date_range(pl.col("Origination date"), pl.col("Reporting Month"), "1mo").list.lengths()
).select('Excel_Reference_Age','diff_months','diff_months_1','diff_months_2')

output

Excel_Reference_Age diff_months diff_months_1   diff_months_2
i64 i16 i32 u32
7   7   7   8
8   8   8   9
9   9   9   10

ViSa
  • 1,563
  • 8
  • 30
  • 1
    Can you try this `test_df.with_columns(diff_months=pl.date_range(pl.col("dt_start"), pl.col("dt_end"), "1mo").list.lengths())` – Abdul Niyas P M Jul 13 '23 at 10:11
  • @AbdulNiyasPM this seems to work but getting a different value from what I get in `excel`. In excel `difference` I get is `7,17` but in `polars` using ur code I am getting `8,18` – ViSa Jul 13 '23 at 10:21
  • @AbdulNiyasPM I also tried your approach on dataset of `Row count: 761403` and it came out to be 10-15 times slower so imho following would be a better option: `test_df.with_columns( diff_months = pl.col('dt_end') - pl.col('dt_start') ).with_columns( diff_months = (pl.col('diff_months').dt.days()/30.42).floor().cast(pl.Int16) )` – ViSa Jul 13 '23 at 11:56
  • Here's a series of semi rhetorical question, how many months are between 2/1/2023 and 3/1/2023? what about 2/1/2024 and 3/1/2024? and 7/1/2023 and 8/1/2023? Are they all one? By your update 2, they're 0.92, 0.95, and 1.02 but you're also taking the floor so then it's 0,0,1. Are you trying to get the number of times the month changed so for instance 12/31/2022 to 1/2/2023 is 1 as would be 12/15/2022 to 1/20/2023? – Dean MacGregor Jul 13 '23 at 14:31
  • @DeanMacGregor I have fanniemae Monthly **Loan Data** which has `Loan Origination` & `Reporting Month` date as - (19-12-1999, 20-07-2000) respectively and I need to use `Loan Age`. **Excel** `DATEDIF()` Formula gives me `7` as the difference which is the `Loan Age` and I am trying to stay in **sync** with excel as I don't want to cause two different calculations when using two different tools. Another example is - (19-12-1999, 20-08-2000) and so on and I am getting `8` for this. I will update the code & snapshot in `UPDATE 3` for reference – ViSa Jul 13 '23 at 16:22

1 Answers1

1

Your update #1 is just fine, you just also need to adjust by 1 if dt.days for end has not reached start:

test_df.with_columns(
    diff_months=12 * (pl.col('dt_end').dt.year() - pl.col('dt_start').dt.year())
    + (
        pl.col('dt_end').dt.month().cast(pl.Int32)
        - pl.col('dt_start').dt.month().cast(pl.Int32)
    )
    - (pl.col('dt_end').dt.day() < pl.col('dt_start').dt.day()).cast(pl.Int32)
)
Wayoshi
  • 1,688
  • 1
  • 7