Combine multiple datetime string columns into one column in Polars

Question

I have the following Python Code with pandas

df['EVENT_DATE'] = df.apply(
        lambda row: datetime.date(year=row.iyear, month=row.imonth, day=row.iday).strftime("%Y-%m-%d"), axis=1)

and want to transform it into a valid Polars Code. Does anyone have any idea to solve this?

score 7 · Accepted Answer · answered Nov 09 '21 at 18:14

I will also answer your generic question and not only you specific use case.

For your specific case, as of polars version >= 0.10.18, the recommended method to create what you want is with the pl.date or pl.datetime expression.

Given this dataframe, pl.date is used to format the date as requested.

import polars as pl

df = pl.DataFrame({
    "iyear": [2001, 2001],
    "imonth": [1, 2],
    "iday": [1, 1]
})


df.with_columns([
    pl.date("iyear", "imonth", "iday").dt.strftime("%Y-%m-%d").alias("fmt")

])

This outputs:

shape: (2, 4)
┌───────┬────────┬──────┬────────────┐
│ iyear ┆ imonth ┆ iday ┆ fmt        │
│ ---   ┆ ---    ┆ ---  ┆ ---        │
│ i64   ┆ i64    ┆ i64  ┆ str        │
╞═══════╪════════╪══════╪════════════╡
│ 2001  ┆ 1      ┆ 1    ┆ 2001-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001  ┆ 2      ┆ 1    ┆ 2001-02-01 │
└───────┴────────┴──────┴────────────┘

Other ways to collect other columns in a single expression

Below is a more generic answer on the main question. We can use a map to get multiple columns as Series, or if we know we want to format a string column, we can use pl.format. The map offers most utility.

df.with_columns([
    # string fmt over multiple expressions
    pl.format("{}-{}-{}", "iyear", "imonth", "iday").alias("date"),
    # columnar lambda over multiple expressions
    pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),
])

This outputs

shape: (2, 5)
┌───────┬────────┬──────┬──────────┬──────────┐
│ iyear ┆ imonth ┆ iday ┆ date     ┆ date2    │
│ ---   ┆ ---    ┆ ---  ┆ ---      ┆ ---      │
│ i64   ┆ i64    ┆ i64  ┆ str      ┆ str      │
╞═══════╪════════╪══════╪══════════╪══════════╡
│ 2001  ┆ 1      ┆ 1    ┆ 2001-1-1 ┆ 2001-1-1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2001  ┆ 2      ┆ 1    ┆ 2001-2-1 ┆ 2001-2-1 │
└───────┴────────┴──────┴──────────┴──────────┘

Avoid row-wise operations

Though, the accepted answer is correct in the result. It's not the recommended way to apply operations over multiple columns in polars. Accessing rows is tremendously slow. Incurring a lot of cache misses, needing to run slow python bytecode and killing all parallelization/ query optimization.

Note

In this specific case, the map creating string data is not recommended:

pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),. Because the way memory is layed out and because we create a new column per string operation, this is actually quite expensive (Only with string data). Therefore there is the pl.format and pl.concat_str.

These are the "right bits" in the lazy api I could not find. No wonder, given that v0.10.18 has been released just right when you answered :) — jvz, Nov 10 '21 at 06:33

jvz · Answer 2 · 2021-11-09T10:54:21.860

Polars apply will return the row data as a tuple, so you would need to use numerical indices instead. Example:

import datetime
import polars as pl

df = pl.DataFrame({"iyear": [2020, 2021],
                   "imonth": [1, 2],
                   "iday": [3, 4]})

df['EVENT_DATE'] = df.apply(
        lambda row: datetime.date(year=row[0], month=row[1], day=row[2]).strftime("%Y-%m-%d"))

In case dfcontains more columns, or in a different order, you could use apply on df[["iyear", "imonth", "iday"]] rather than df to ensure the indices refer to the right elements.

There may be better ways to achieve this, but this comes closest to the Pandas code.

On a separate note, my guess is you don't want to store the dates as a string, but rather as a proper pl.Date. You could modify the code in this way:

def days_since_epoch(dt):
    return (dt - datetime.date(1970, 1, 1)).days


df['EVENT_DATE_dt'] = df.apply(
        lambda row: days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2])), return_dtype=pl.Date)

where we first convert the Python date to days since Jan 1, 1970, and then convert to a pl.Date using apply's return_dtype argument. The cast to pl.Date needs an int rather than a Python datetime, as it stores the data as an int ultimately. This is most easily seen by simply accessing the dates:

print(type(df["EVENT_DATE_dt"][0]))  # >>> <class 'int'>
print(type(df["EVENT_DATE_dt"].dt[0]))  # >>> <class 'datetime.date'>

Would be great if the cast does operate on the Python datetime directly.

edit: on the conversation on performance vs Pandas. For both Pandas and Polars, you could speed this up further if you have many duplicate rows (for year/month/day), by using a cache to speedup the apply. I.e.

from functools import lru_cache

@lru_cache
def row_to_date(row):
    return days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2]))

df['EVENT_DATE_dt'] = df.apply(row_to_date, return_dtype=pl.Date)

This will improve runtime when there are many duplicate entries, at the expense of some memory. If there are no duplicates, it will probably slow you down.

Ty for the answer, this works!! I have 2 questions. First, do you know a better way to achieve this in polars? The Polars apireference says this will be slow. The second question is about the second part u posted, why do you use the days since Jan 1, 1970 to convert it to a pl.Date? — seb2704, Nov 09 '21 at 07:21
On the first: no, I tried to use the lazy api initially, but I could not find the right bits to do it. Have you tried to time it, is the runtime prohibitive? On the second part, I have edited my answer. — jvz, Nov 09 '21 at 09:27
Thank you for the explanation :). I compared the execution time of the pandas code with your polars code snippet, polars is still 8 times faster although using the eager api, realy amazing. Maybe i did a mistake at the measurement, honsestly i can't explain my self how the difference between the execution time is that huge with the eager api. — seb2704, Nov 09 '21 at 10:02
In my experience, Pandas `apply` is really slow, so that may be correct. Depending on how your data looks like, you also consider caching the calls to the lambda. See my edit at the bottom of my answer. — jvz, Nov 09 '21 at 10:51
Just for completion, regarding the matter of the Polars execution being "8 times faster" than Pandas': I'd guess that your CPU has 8 Logical Processors available. Polars is able to parallelize some executions, even when not using the Lazy API, so that could explain the 8x speedup. However, as pointed out in @ritchie46 's answer, the true power of Polars (and Pandas, btw) appears when we work primarily with vectorized (column-wise) operations instead of row-wise operations (i.e., use Polar's Expressions API, and not .apply()) — Vinícius Queiroz, Mar 15 '23 at 23:39

Combine multiple datetime string columns into one column in Polars

2 Answers2

Other ways to collect other columns in a single expression

Avoid row-wise operations

Note

Linked