-2

I've been recently using Polars for a project im starting to develop, ive come across several problems but with the info here and in the docs i have solved those issues.

My issue: When I save the dataframe it stores datetime data like this:

1900-01-01T18:00:00.000000000

Load dataframe from saved dataframe

When it shows like this in my console ( I have checked type is object) : 1900-01-01 18:00:00 Dataframe show pre-saved

Code:

'''
My column is a string like this: 1234 , this means 12:34, so i do the following transformation:
'''
df = df2.with_columns([
    pl.col('initial_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')),
    pl.col('final_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')),
])

df = df.drop('version').rename({'day_type': 'day'})
print(df)
print(df.dtypes)

#output:  <class 'polars.datatypes.Datetime'>, <class 'polars.datatypes.Datetime'>
'''
I save it with write_csv
'''
df.write_csv('data/trp_occupation_level_emt_cleaned.csv', sep=",")


dfnew = pl.read_csv('data/trp_occupation_level_emt_cleaned.csv')

# print new df
print(dfnew.head())
print(dfnew.dtypes)
# output: <class 'polars.datatypes.Utf8'>, <class 'polars.datatypes.Utf8'>

I know i can read the csv with parsed_dates= True, but i consume this dataframe in a database so i need it to export it with dates parsed.

2 Answers2

1

Polars does not default to parsing string data as dates automatically.

But you can easily turn it on by setting the parse_dates keyword argument.

pl.read_csv("myfile.csv", parse_dates=True)
ritchie46
  • 10,405
  • 1
  • 24
  • 43
  • as I commented, I already now this trick, i need it to save it as datetime because im passing that csv to a different service. Still, thanks for the answer ritchie. – luis.martinez.pro Apr 08 '22 at 12:21
  • Then set that keyword argument at that service? In any case, if you want your type data to be conserved you should not use csv as a format, but use one of the binary formats like IPC/feather or parquet. – ritchie46 Apr 08 '22 at 14:52
1

It sounds like you want to specify the formatting of Date and Datetime fields in an output csv file - to conform with the formatting requirements of an external application (e.g., database loader).

We can do that easily using the strftime format function. Basically, we will convert the Date/Datetime fields to strings, formatted as we need them, just before we write the csv file. This way, the csv output writer will not alter them.

For example, let's start with this data:

from io import StringIO
import polars as pl

my_csv = """sample_id,initial_band_time,final_band_time
1,2022-01-01T18:00:00,2022-01-01T18:35:00
2,2022-01-02T19:35:00,2022-01-02T20:05:00
"""

df = pl.read_csv(StringIO(my_csv), parse_dates=True)
print(df)
shape: (2, 3)
┌───────────┬─────────────────────┬─────────────────────┐
│ sample_id ┆ initial_band_time   ┆ final_band_time     │
│ ---       ┆ ---                 ┆ ---                 │
│ i64       ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 1         ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2         ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │
└───────────┴─────────────────────┴─────────────────────┘

Now, we'll apply the strftime function and the following format specifier %F %T.

df = df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T"))
print(df)
shape: (2, 3)
┌───────────┬─────────────────────┬─────────────────────┐
│ sample_id ┆ initial_band_time   ┆ final_band_time     │
│ ---       ┆ ---                 ┆ ---                 │
│ i64       ┆ str                 ┆ str                 │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 1         ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2         ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │
└───────────┴─────────────────────┴─────────────────────┘

Notice that our Datetime fields have been converted to strings (the 'str' in the column header).

And here's a pro tip: notice that I'm using a datatype wildcard expression in the col expression: pl.col(pl.Datetime). This way, you don't need to specify each Datetime field; Polars will automatically convert them all.

Now, when we write the csv file, we get the following output.

df.write_csv('/tmp/tmp.csv')

Output csv:

sample_id,initial_band_time,final_band_time
1,2022-01-01 18:00:00,2022-01-01 18:35:00
2,2022-01-02 19:35:00,2022-01-02 20:05:00

You may need to play around with the format specifier until you find one that your external application will accept. Here's a handy reference for format specifiers.

Here's another trick: you can do this step just before writing the csv file:

df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T")).write_csv('/tmp/tmp.csv')

This way, your original dataset is not changed ... only the copy that you intend to write to a csv file.

BTW, I use this trick all the time when writing csv files that I intend to use in spreadsheets. I often just want the "%F" (date) part of the datetime, not the "%T" part (time). It just makes parsing easier in the spreadsheet.