0

How does one read a csv into a polar DataFrame and parse one of the columns as a datetime?

Alternatively, how does one convert a column to a pl.datetime?

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Test
  • 962
  • 9
  • 26

2 Answers2

3

I would first try parse_dates=True in the read_csv call.

For example, let's say we have the following data:

import polars as pl
from io import StringIO
my_csv = StringIO(
"""
ID,start,last_updt,end
1,2008-10-31, 2020-11-28 12:48:53,12/31/2008
2,2007-10-31, 2021-11-29 01:37:20,12/31/2007
3,2006-10-31, 2021-11-30 23:22:05,12/31/2006
"""
)

pl.read_csv(my_csv, parse_dates=True)
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ str        │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 12/31/2008 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 12/31/2007 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 12/31/2006 │
└─────┴────────────┴─────────────────────┴────────────┘

The start column parsed as a Date, and the last_updt column parsed as a Datetime. But notice that the end column did not parse as a date because it is not in ISO 8601 format. (I've come across plenty of csv files where Date/Datetime fields were non-standard.)

To parse this column, we can use the strptime function and supply the appropriate format.

pl.read_csv(my_csv, parse_dates=True).with_column(pl.col('end').str.strptime(pl.Date, '%m/%d/%Y'))
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ date       │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 2008-12-31 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 2007-12-31 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 2006-12-31 │
└─────┴────────────┴─────────────────────┴────────────┘

strptime can also be used with Datetime columns.

1

Polars supports two csv readers, one built-in and one based on pyarrow. The pyarrow reader supports parsing dates directly; see also https://github.com/pola-rs/polars/issues/1330. You can set use_pyarrow=True in read_csv, but as per the documentation, it will only be used given also the other parameter inputs into read_csv.

Alternatively, read as Utf8 (string), and parse to string with strptime: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.internals.series.StringNameSpace.strptime.html?highlight=strptime#polars.internals.series.StringNameSpace.strptime. This is the method I find easier typically, but may, depending on the size of your data, be relatively expensive as you first need to store as Utf8 and then parse.

jvz
  • 1,183
  • 6
  • 13