Repeating a date in polars and exploding it

Question

I have a polars dataframe with two date columns that represent a start and end date and then a value that I want to repeat for all dates in between those two dates so that I can join those on other tables.

Example input is

id	start	end	value
123	2022-01-01	2022-01-04	10
abc	2022-03-04	2022-03-04	3
456	2022-05-11	2022-05-16	4

and expected output is

id	date	value
123	2022-01-01	10
123	2022-01-02	10
123	2022-01-03	10
123	2022-01-04	10
abc	2022-03-04	3
456	2022-05-11	4
456	2022-05-12	4
456	2022-05-13	4
456	2022-05-14	4
456	2022-05-15	4
456	2022-05-16	4

What exactly is "a polars dataframe"? – DYZ Jul 29 '22 at 03:00 — DYZ, Jul 29 '22 at 03:00
What have you tried? This is not hard. – Tim Roberts Jul 29 '22 at 03:01 — Tim Roberts, Jul 29 '22 at 03:01

score 1 · Accepted Answer · answered Jul 29 '22 at 13:44

I struggled today with the same problem and I thought I share my solution.

As cbilot already mentions pl.dat_range doesn't take expressions as low and high value. So I worked around by using apply.

Data:

import polars as pl
from datetime import date

df = pl.DataFrame(
    {
        "id": ["123", "abc", "456"],
        "start": [date(2022, 1, 1), date(2022, 3, 4), date(2022, 5, 11)],
        "end": [date(2022, 1, 4), date(2022, 3, 4), date(2022, 5, 16)],
        "value": [10, 3, 4],
    }
)

Solution:

(
    df.with_columns(
        [(pl.struct(["start", "end"])
            .apply(lambda x: pl.date_range(x["start"], x["end"], "1d"))
            .alias("date"))])
    .explode(pl.col("date"))
    .select(["id", "date", "value"])
)


shape: (11, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ value │
│ --- ┆ ---        ┆ ---   │
│ str ┆ date       ┆ i64   │
╞═════╪════════════╪═══════╡
│ 123 ┆ 2022-01-01 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-02 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-03 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-04 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ...        ┆ ...   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-13 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-14 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-15 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-16 ┆ 4     │
└─────┴────────────┴───────┘

What I really like about your solution is that it will work for durations other than 1 day. — , Jul 29 '22 at 15:55
Thanks! This works perfectly, is super elegant and easily extensible. Was trying to do the `date_range` and `explode` but didn't figure out the struct. — slugmagug, Jul 30 '22 at 03:31

score 1 · Answer 2 · answered Mar 06 '23 at 12:58

1

I managed to implement it without using apply:

df = pl.DataFrame(
{
    "id": ["123", "abc", "456"],
    "start": [date(2022, 1, 1), date(2022, 3, 4), date(2022, 5, 11)],
    "end": [date(2022, 1, 4), date(2022, 3, 4), date(2022, 5, 16)],
    "value": [10, 3, 4],
})

df = df.with_columns(
        [pl.date_range(low=pl.col('start'), high=pl.col('end'), interval='1d')
        .alias('date')])\
    .explode('date')\
    .select(["id", "date", "value"])

answered Mar 06 '23 at 12:58

aSel80

11
2

Nice answer! To avoid a deprecation warning this should use [pl.date_ranges](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.date_ranges.html) these days. – Yourstruly Aug 29 '23 at 18:06

score 0 · Answer 3 · answered Jul 29 '22 at 04:26

Starting with this data:

import polars as pl
from datetime import date

df = pl.DataFrame(
    {
        "id": ["123", "abc", "456"],
        "start": [date(2022, 1, 1), date(2022, 3, 4), date(2022, 5, 11)],
        "end": [date(2022, 1, 4), date(2022, 3, 4), date(2022, 5, 16)],
        "value": [10, 3, 4],
    }
)
df

shape: (3, 4)
┌─────┬────────────┬────────────┬───────┐
│ id  ┆ start      ┆ end        ┆ value │
│ --- ┆ ---        ┆ ---        ┆ ---   │
│ str ┆ date       ┆ date       ┆ i64   │
╞═════╪════════════╪════════════╪═══════╡
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ abc ┆ 2022-03-04 ┆ 2022-03-04 ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     │
└─────┴────────────┴────────────┴───────┘

The Algorithm

(
    df.with_columns(
        [pl.arange(pl.col("start"), pl.col("end") + 1).alias("date")])
    .explode("date")
    .with_column(pl.col("date").cast(pl.Date))
    .select(["id", "date", "value"])
)

shape: (11, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ value │
│ --- ┆ ---        ┆ ---   │
│ str ┆ date       ┆ i64   │
╞═════╪════════════╪═══════╡
│ 123 ┆ 2022-01-01 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-02 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-03 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-04 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ abc ┆ 2022-03-04 ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-12 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-13 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-14 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-15 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-16 ┆ 4     │
└─────┴────────────┴───────┘

In steps

Normally, we create a range of dates using the date_range expression. However, date_range does not take an Expression as its low and high parameters.

However, arange does allow Expressions as its low and high parameters. We can (implicitly) cast the start and end dates to integers, which represent the number of days since the UNIX epoch.

The result is a list of integers which represents the days between (and including) the start and end dates (expressed as days since the UNIX epoch)..

Notice that we have to add 1 to the high parameter to make sure we capture the end date.

(
    df.with_columns(
        [pl.arange(pl.col("start"), pl.col("end") + 1).alias("date")])
)

shape: (3, 5)
┌─────┬────────────┬────────────┬───────┬───────────────────────────┐
│ id  ┆ start      ┆ end        ┆ value ┆ date                      │
│ --- ┆ ---        ┆ ---        ┆ ---   ┆ ---                       │
│ str ┆ date       ┆ date       ┆ i64   ┆ list[i64]                 │
╞═════╪════════════╪════════════╪═══════╪═══════════════════════════╡
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    ┆ [18993, 18994, ... 18996] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ abc ┆ 2022-03-04 ┆ 2022-03-04 ┆ 3     ┆ [19055]                   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ [19123, 19124, ... 19128] │
└─────┴────────────┴────────────┴───────┴───────────────────────────┘

Next we can use explode to place each of the integers on a separate row.

(
    df.with_columns(
        [pl.arange(pl.col("start"), pl.col("end") + 1).alias("date")])
    .explode("date")
)

shape: (11, 5)
┌─────┬────────────┬────────────┬───────┬───────┐
│ id  ┆ start      ┆ end        ┆ value ┆ date  │
│ --- ┆ ---        ┆ ---        ┆ ---   ┆ ---   │
│ str ┆ date       ┆ date       ┆ i64   ┆ i64   │
╞═════╪════════════╪════════════╪═══════╪═══════╡
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    ┆ 18993 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    ┆ 18994 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    ┆ 18995 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-01 ┆ 2022-01-04 ┆ 10    ┆ 18996 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ abc ┆ 2022-03-04 ┆ 2022-03-04 ┆ 3     ┆ 19055 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19123 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19124 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19125 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19126 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19127 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 2022-05-16 ┆ 4     ┆ 19128 │
└─────┴────────────┴────────────┴───────┴───────┘

The final step is to cast the date column back to a pl.Date, and then select only the columns that we want.

(
    df.with_columns(
        [pl.arange(pl.col("start"), pl.col("end") + 1).alias("date")])
    .explode("date")
    .with_column(pl.col("date").cast(pl.Date))
    .select(["id", "date", "value"])
)

shape: (11, 3)
┌─────┬────────────┬───────┐
│ id  ┆ date       ┆ value │
│ --- ┆ ---        ┆ ---   │
│ str ┆ date       ┆ i64   │
╞═════╪════════════╪═══════╡
│ 123 ┆ 2022-01-01 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-02 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-03 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 123 ┆ 2022-01-04 ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ abc ┆ 2022-03-04 ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-11 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-12 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-13 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-14 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-15 ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 456 ┆ 2022-05-16 ┆ 4     │
└─────┴────────────┴───────┘

Repeating a date in polars and exploding it

3 Answers3

The Algorithm

In steps