3

How would I add 1 year to a column?

I've tried using map and apply but I failed miserably.

I also wonder why pl.date() accepts integers while it advertises that it only accepts str or pli.Expr.

A small hack workaround is:

col = pl.col('date').dt
df = df.with_column(pl.when(pl.col(column).is_not_null())
                    .then(pl.date(col.year() + 1, col.month(), col.day()))
                    .otherwise(pl.date(col.year() + 1,col.month(), col.day()))
                    .alias("date"))

but this won't work for months or days. I can't just add a number or I'll get a:

> thread 'thread '<unnamed>' panicked at 'invalid or out-of-range date<unnamed>',
         ' panicked at '/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rsinvalid or out-of-range date:', 173:/github/home/.cargo/registry/src/github.com-1ecc6299db9ec823/chrono-0.4.19/src/naive/date.rs51
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Most likely because day and month cycle while year goes to infinity.

I could also do this:

df = df.with_column(
        pl.when(col.month() == 1)
        .then(pl.date(col.year(), 2, col.day()))
        .when(col.month() == 2)
        .then(pl.date(col.year(), 3, col.day()))
        .when(col.month() == 3)
        .then(pl.date(col.year(), 4, col.day()))
        .when(col.month() == 4)
        .then(pl.date(col.year(), 5, col.day()))
        .when(col.month() == 5)
        .then(pl.date(col.year(), 6, col.day()))
        .when(col.month() == 6)
        .then(pl.date(col.year(), 7, col.day()))
        .when(col.month() == 7)
        .then(pl.date(col.year(), 8, col.day()))
        .when(col.month() == 8)
        .then(pl.date(col.year(), 9, col.day()))
        .when(col.month() == 9)
        .then(pl.date(col.year(), 10, col.day()))
        .when(col.month() == 10)
        .then(pl.date(col.year(), 11, col.day()))
        .when(col.month() == 11)
        .then(pl.date(col.year(), 12, col.day()))
        .otherwise(pl.date(col.year() + 1, 1, 1))
        .alias("valid_from")
    )
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
supersick
  • 261
  • 2
  • 14
  • Can you try with UDFs? https://sparkbyexamples.com/pyspark/pyspark-udf-user-defined-function/amp/ – Nandha Jun 25 '22 at 01:02

2 Answers2

3

Polars allows to do addition and subtraction with python's timedelta objects. However above week units things get a bit more complicated as we have to take different days of the month and leap years into account.

For this polars has offset_by under the dt namespace.

(pl.DataFrame({
    "dates": pl.date_range(datetime(2000, 1, 1), datetime(2026, 1, 1), "1y")
}).with_columns([
    pl.col("dates").dt.offset_by("1y").alias("dates_and_1_yr")
]))
shape: (27, 2)
┌─────────────────────┬─────────────────────┐
│ dates               ┆ dates_and_1_yr      │
│ ---                 ┆ ---                 │
│ datetime[ns]        ┆ datetime[ns]        │
╞═════════════════════╪═════════════════════╡
│ 2000-01-01 00:00:00 ┆ 2001-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001-01-01 00:00:00 ┆ 2002-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2002-01-01 00:00:00 ┆ 2003-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2003-01-01 00:00:00 ┆ 2004-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 00:00:00 ┆ 2024-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2024-01-01 00:00:00 ┆ 2025-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2025-01-01 00:00:00 ┆ 2026-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2026-01-01 00:00:00 ┆ 2027-01-01 00:00:00 │
└─────────────────────┴─────────────────────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • Which solution would be faster/better? With the apply method mentioned by @datenzauber.ai i don't have to calculate the leap year since i am adding years and not days. Is there any noticeable difference? If i wanted to do the same with 1 month instead of 1 year, datenzaubers solution will be easy because i will always add 1 month with 1 line of code, but with your solution i would have to calculate how many days the given month has first. Is this correct? – supersick Jun 25 '22 at 11:21
  • 2
    Using polars expressions is always tons faster than custom python. I benchmarked it locally to be at least 300x faster. – ritchie46 Jun 25 '22 at 11:41
1

You can use polars.apply and dateutil.relativedelta which works for years, months, days and much more, but can be slow for lots of data.

from datetime import date
from dateutil.relativedelta import relativedelta

df = pl.DataFrame(pl.date_range(date(2019, 1, 1), date(2020, 10, 1), '3mo', name='date'))
df.with_column(pl.col('date').apply(lambda x: x + relativedelta(years=1)))

Update: Since the offset_by method is now also available for months, it should be used whenever possible (see accepted answer). I leave this answer here because the approach can be used for more complicated cases that are not supported by offset_by.

datenzauber.ai
  • 379
  • 2
  • 11