0

I have a python polars dataframe that is quite large where Pandas runs into memory errors. I want to use python polars but am running into an issue of taking a integer representation of date to make two new columns: PeriodDate, and LagDate. I can do this on a sample in Pandas using the following:

df['PeriodDate'] = pd.to_datetime(df['IntegerDate'],format='%Y%m')
df['LaggedDate'] = df['PeriodDate'] - pd.DateOffset(months=1)

I have tried to do the following:

df.with_columns(
  pl.col('IntegerDate').str.strptime(pl.Datetime,"%Y%m")
)

SchemaError: Series of dtype: Int64 != Utf8.

For reference the 'IntegerDate' column is of the format: 202005, 202006, ...etc

I haven't been able to find good examples of how to do this in polars so any help would be greatly appreciated.

Thanks!

Drthm1456
  • 409
  • 9
  • 17

1 Answers1

1

Here is Polars implementation

import polars as pl
from datetime import datetime as dt

df = pl.DataFrame({'IntegerDate': [202005, 202006, 202207, 202303, 202109]})

df = df.with_columns(pl.col('IntegerDate').cast(pl.Utf8).str.strptime(pl.Date, "%Y%m").alias('PeriodDate'))

df

Output

┌─────────────┬────────────┐
│ IntegerDate ┆ PeriodDate │
│ ---         ┆ ---        │
│ i64         ┆ date       │
╞═════════════╪════════════╡
│ 202005      ┆ 2020-05-01 │
│ 202006      ┆ 2020-06-01 │
│ 202207      ┆ 2022-07-01 │
│ 202303      ┆ 2023-03-01 │
│ 202109      ┆ 2021-09-01 │
└─────────────┴────────────┘
# add `LaggedDate`
df = df.with_columns(df['PeriodDate'].dt.offset_by("-1mo").alias('LaggedDate'))
df

Final Output

┌─────────────┬────────────┬────────────┐
│ IntegerDate ┆ PeriodDate ┆ LaggedDate │
│ ---         ┆ ---        ┆ ---        │
│ i64         ┆ date       ┆ date       │
╞═════════════╪════════════╪════════════╡
│ 202005      ┆ 2020-05-01 ┆ 2020-04-01 │
│ 202006      ┆ 2020-06-01 ┆ 2020-05-01 │
│ 202207      ┆ 2022-07-01 ┆ 2022-06-01 │
│ 202303      ┆ 2023-03-01 ┆ 2023-02-01 │
│ 202109      ┆ 2021-09-01 ┆ 2021-08-01 │
└─────────────┴────────────┴────────────┘

more on date offsets - https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.dt.offset_by.html

ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
Scarface
  • 359
  • 2
  • 13
  • 1
    You can use `pl.col("IntegerDate").cast(pl.Utf8).str.strptime(pl.Date, fmt="%Y%m")` for the first part instead of [`.apply`](https://stackoverflow.com/a/75351869/19355181) – jqurious Feb 07 '23 at 05:58