0

How can I process a column containing datetimes in two formats, both "%Y-%m-%dT%H:%M", and "%Y-%m-%dT%H:%M:%S" ?

MWE showing what I'm trying to do:

from pandera.engines import pandas_engine
from pathlib import Path
import io
import pandas as pd
import pandera as pa

# this doesn't work
data = 'date_column\n2020-11-26T02:06:30\n2020-11-22T01:49\n'
df = pd.read_csv(io.StringIO(data))

schema = pa.DataFrameSchema(
        {
            "date_column": pa.Column(
                        pandas_engine.DateTime(
                            to_datetime_kwargs = {
                                "format":"%Y-%m-%dT%H:%M:%S"},
                            tz = "Europe/London")
                        ),
        }, 
        coerce=True
)
new_df = schema.validate(df)

Which gives the following errors for different format strings:

# using format: "%Y-%m-%dT%H:%M"

# pandera.errors.SchemaError: 
# Error while coercing 'date_column' to type datetime64[ns, Europe/London]: 
# Could not coerce <class 'pandas.core.series.Series'> data_container into type datetime64[ns, Europe/London]:
#    index         failure_case
# 0      0  2020-11-26T02:06:30

And:

# using format: "%Y-%m-%dT%H:%M:%S"

# pandera.errors.SchemaError:
# Error while coercing 'date_column' to type datetime64[ns, Europe/London]: 
# Could not coerce <class 'pandas.core.series.Series'> data_container into type datetime64[ns, Europe/London]:
#    index      failure_case
# 0      1  2020-11-22T01:49
baxx
  • 3,956
  • 6
  • 37
  • 75

1 Answers1

1

pandas does not support multiple datetime formats in the same column. A workaround can be the following:

schema = pa.DataFrameSchema(
    {
        "date_column": pa.Column(
            pandas_engine.DateTime(
                to_datetime_kwargs = {
                    "infer_datetime_format": True},
                tz = "Europe/London"
            )
        ),
    }, 
    coerce=True
)
Norhther
  • 545
  • 3
  • 15
  • 35