How can I process a column containing datetimes in two formats, both "%Y-%m-%dT%H:%M"
, and "%Y-%m-%dT%H:%M:%S"
?
MWE showing what I'm trying to do:
from pandera.engines import pandas_engine
from pathlib import Path
import io
import pandas as pd
import pandera as pa
# this doesn't work
data = 'date_column\n2020-11-26T02:06:30\n2020-11-22T01:49\n'
df = pd.read_csv(io.StringIO(data))
schema = pa.DataFrameSchema(
{
"date_column": pa.Column(
pandas_engine.DateTime(
to_datetime_kwargs = {
"format":"%Y-%m-%dT%H:%M:%S"},
tz = "Europe/London")
),
},
coerce=True
)
new_df = schema.validate(df)
Which gives the following errors for different format strings:
# using format: "%Y-%m-%dT%H:%M"
# pandera.errors.SchemaError:
# Error while coercing 'date_column' to type datetime64[ns, Europe/London]:
# Could not coerce <class 'pandas.core.series.Series'> data_container into type datetime64[ns, Europe/London]:
# index failure_case
# 0 0 2020-11-26T02:06:30
And:
# using format: "%Y-%m-%dT%H:%M:%S"
# pandera.errors.SchemaError:
# Error while coercing 'date_column' to type datetime64[ns, Europe/London]:
# Could not coerce <class 'pandas.core.series.Series'> data_container into type datetime64[ns, Europe/London]:
# index failure_case
# 0 1 2020-11-22T01:49