5

I'm looking for a way to detect the strftime-style format of a datetime string in Python. All datetime libraries I've found have functionalities for parsing the string to create a datetime object, but I would like to detect the format or pattern that can be used with the datetime.strptime format parameter.

Why? I'm dealing with long lists (or series) of datetime strings and using dateutil.parser to parse them is too inaccurate and slow.

  • Slow: It will check for all potential formats every single time, although all entries per list are of the same format (in my case).
  • Inaccurate: Ambiguous entries will be parsed in one out of multiple ways without drawing knowledge from other entries that are not ambiguous.

So instead I would like to detect the format. Once I have that, I can use the to_datetime function in polars to create a datetime series in a faster manner.

I couldn't find such functionality in more modern datetime libs like pendulum. I also implemented my own version that iterates through a fixed lists of formats and checks if it can be read using datetime.strptime like so:

patterns = [
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d %H:%M",
        "%Y-%m-%d",
        ...
    ]

    for pattern in patterns:
        try:
            for val in col:
                assert datetime.datetime.strptime(val, pattern)
            return pattern
        except:
            continue

This doesn't strike me as an elegant solution and I was wondering if there's a better way to do it or even a library available that does this sort of thing.

pietz
  • 2,093
  • 1
  • 21
  • 23
  • concerning "inaccuracy"; I think there are cases to keep in mind that will be ambiguous, no matter what library / algorithm you use: for instance is 10-11-2012 Oct 11 or Nov 10? It will be up to you to make an unambiguous specification. – FObersteiner May 29 '23 at 15:15
  • that's exactly my point. those examples alone are ambiguous but since I'm using the function to analyze a long list of values (all with the same format), the probability of all being ambiguous i basically zero. At one point, a value will give away if it's DMY or MDY. That's what i want to take advantage of, which dateutil cannot. – pietz May 29 '23 at 19:33

4 Answers4

2

What about outsourcing your task to to get use of its new mixed date-parsing ?

dt_strs = [
    "Mon, 29 May 2023 13:15:09 +0000",  # Day, DD Month YYYY HH:MM:SS +0000
    "10/01/2020 12:15:33",              # MM/DD/YYYY HH:MM:SS
    "2020-08-01",                       # YYYY-MM-DD
    "08:55",                            # HH:MM
    "2019.09.18T18:51:57",              # YYYY.MM.DDTHH:MM:SS
    "11:29:10",                         # HH:MM:SS
    "23/05/2022 03:30:00 +0500",        # DD/MM/YYYY HH:MM:SS +0000
    "02.28.19",                         # MM.DD.YY
    "2023-01-01 22:23",                 # YYYY-MM-DD HH:MM
    "31 jul, 2022",                     # DD Month, YYYY
    "2021/12/18 06:13:08",              # YYYY/MM/DD HH:MM:SS
    "2023",                             # YYYY
]

pl_ser = pl.from_pandas(pd.to_datetime(dt_strs, format="mixed").to_series(name="dts"))

Output :

print(pl_ser)

shape: (12,)
Series: 'dts' [datetime[μs, UTC]]
[
    2023-05-29 13:15:09 UTC
    2020-10-01 12:15:33 UTC
    2020-08-01 00:00:00 UTC
    2023-05-29 08:55:00 UTC
    2019-09-18 18:51:57 UTC
    2023-05-29 11:29:10 UTC
    2022-05-22 22:30:00 UTC
    2019-02-28 00:00:00 UTC
    2023-01-01 22:23:00 UTC
    2022-07-31 00:00:00 UTC
    2021-12-18 06:13:08 UTC
    2023-01-01 00:00:00 UTC
]
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • **NB**: Each element of the index returned by [`pd.to_datetime`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) is a `datetime.datetime` *object*. – Timeless May 29 '23 at 09:27
  • I removed the part of my answer with Pandas. Yours is more complete with `format='mixed'` – Corralien May 29 '23 at 09:37
  • Thanks Corralien, the mixed parsing seems to do the job ;) – Timeless May 29 '23 at 12:51
  • 1
    I prefer my answer in pure python ;) lol. – Corralien May 29 '23 at 12:54
  • also important to note: "*“mixed”, to infer the format for each element individually. This is risky, and you should probably use it along with dayfirst.*" (docs) – FObersteiner May 29 '23 at 15:14
  • Not part of the question but I really don't want the pandas dependency. I'm running serverless code where size matters. Adding pandas and numpy makes it too heavy. Good approach though. – pietz May 31 '23 at 18:52
2

I don't see another method to convert your values as datetime using only Python. However, I think you should use another variable to memorize the last successful pattern to increase the speed. Something like:

import datetime

patterns = [
        "%Y-%m-%d %H:%M:%S",
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d",
]


def to_datetime(values, errors='raise'):

    last_success_pattern = patterns[0] 
    dti = []

    for value in values:
        try:
            dt = datetime.datetime.strptime(value, last_success_pattern)
        except ValueError:
            for pattern in [last_success_pattern] + patterns:
                try:
                    dt = datetime.datetime.strptime(value, pattern)
                except ValueError:
                    if errors == 'raise':
                        raise ValueError(f'Unknown format for "{value}"')
                    elif errors == 'coerce':
                        dt = None  # invalid value is set to None
                    elif errors == 'ignore':
                        dt = value  # invalid value returns the input
                else:
                    last_success_pattern = pattern
                    break

        dti.append(dt)
    return dti

if __name__ == '__main__':
    print(to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', '2023-01-03', '2023-01-01T21:12:33.078196+00:00'], errors='coerce'))

Output:

>>> to_datetime(['2023.01.01 11:22:33', '2023.01.02 14:15:16', 
                 '2023-01-03', '2023-01-01T21:12:33.078196+00:00'],
                errors='coerce')

[datetime.datetime(2023, 1, 1, 11, 22, 33),
 datetime.datetime(2023, 1, 2, 14, 15, 16),
 datetime.datetime(2023, 1, 3, 0, 0),
 None]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • 1
    I'd prefer this approach as it is deterministic in a sense; it leaves less room for ambiguity / gives more control over what is happening. However there might still be cases where parsing is successful but the result is wrong, e.g. the day-first vs. month-first cases. – FObersteiner May 29 '23 at 15:11
  • i found a way around strptime, which in itself is surprisingly slow and posted the regex based answer on here as well. thanks though. – pietz May 31 '23 at 18:53
0

In summary you are presented with various dfs that each have an internally consistent datetime format but each one is not consistent with the next. How to deal with that?

You have:

 patterns = [
        "%Y.%m.%d %H:%M:%S",
        "%Y-%m-%d %H:%M",
        "%Y-%m-%d",
    ]

From the strptime docs, you can use coalesce if you had internally inconsistent formats:

One idea (probably not for you)

(
    df
    .with_columns(Date=pl.coalesce(
        pl.col('Date').str.strptime(pl.Datetime(), x, strict=False) for x in patterns
        ))
)

This can be problematic if you have a mix of formats where many will work but only one is right.

Second alternative (might be good for you)

for pattern in patterns:
    try:
        df=df.with_columns(Date=pl.col('Date').str.strptime(pl.Datetime(), pattern))
        break
    except:
        pass

Third alternative (might be good for you)

You can also use the pandas to_datetime without converting your whole df to pandas.

(
    df
        .with_columns(
            Date=pl.col('Date')
              .map(lambda x: pl.Series(pd.to_datetime(x.to_numpy(), format='mixed'))))
)

You can that look a bit neater if you just define the helper function in the lambda instead of using the lambda like this:

def mixed_strptime(x):
    return pl.Series(pd.to_datetime(x.to_numpy(), format='mixed'))
(
    df
        .with_columns(
            Date=pl.col('Date').map(mixed_strptime))
)
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • thanks for this. option 2 and 3 look good but they're just too slow for the number of patterns i want to support. i went with a regex approach instead. also posted here. – pietz May 31 '23 at 18:54
0

Thanks to everyone who contributed. I went down the rabbit hole myself, only to notice that it's indeed a deep rabbit hole.

The first approach was based on my initial idea extended by a create_dt_patterns() function, that generates a wide combination of patterns. After this list exceeded 500 patterns, I tweaked the code a bit to produce only patterns that fit the separators in the string. I combined this with a counter to only return the most frequently used pattern. This helps with ambiguous results. If you're interested in my potentially ugly and incomplete code, here you go:

def get_dt_format(col: list, patterns: list, n: int = 100):
    for val in col[:n]:
        for pattern in patterns.keys():
            try:
                _ = datetime.strptime(val, pattern)
                patterns[pattern] += 1
            except:
                pass

    if sum(patterns.values()) == 0:
        return False

    return max(patterns, key=patterns.get)

def create_dt_patterns(dt_str: str):
    dates = create_date_patterns(dt_str)
    times = create_time_patterns(dt_str)
    seps = create_dt_separators(dt_str)
    dts = [d + s + t for d in dates for s in seps for t in times]
    dts += dates + times
    return {x: 0 for x in dts}


def create_dt_separators(dt_str: str):
    seps = [x for x in ["T", " "] if x in dt_str]
    return [""] if len(seps) == 0 else seps


def create_time_patterns(dt_str: str):
    time_opts = ["%H{0}%M{0}%S.%f", "%H{0}%M{0}%S", "%H{0}%M"]
    return [t.format(":" if ":" in dt_str else "") for t in time_opts]


def create_date_patterns(dt_str: str):
    date_opts = [
        "%Y{0}%m{0}%d",
        "%m{0}%d{0}%Y",
        "%d{0}%m{0}%Y",
        "%y{0}%m{0}%d",
        "%m{0}%d{0}%y",
        "%d{0}%m{0}%y",
    ]
    dates = [d.format(s) for d in date_opts for s in ["-", "/", "."] if s in dt_str]
    if len(dates) == 0:
        dates = [d.format("") for d in date_opts]
    return dates

Then I took a shower because I felt dirty writing this code and also noticed that strptime is not a very fast function. Doing this for n entries using m different patterns over k columns takes some time.

The next stop was a RegEx approach, which is much quicker and made me feel a little better about myself. It's roughly 30x faster than trying different patterns.

Since I'm planning on testing and developing this logic over time, I created a GitHub repo for it. To also post it here, which I probably will not update:

DATE_RE = r"(?P<date>\d{2,4}[-/.]\d{2}[-/.]\d{2,4})?"
SEP_RE = r"(?P<sep>\s|T)?"
TIME_RE = r"(?P<time>\d{2}:\d{2}(:\d{2})?\s*([AP]M)?)?"
FULL_RE = DATE_RE + SEP_RE + TIME_RE
YMD_RE = r"^(?P<ay>(?:[12][0-9])?[0-9]{2})(?P<bs>[-/.])(?P<cm>0[1-9]|1[0-2])(?P<ds>[-/.])(?P<ed>0[1-9]|[12][0-9]|3[01])$"
DMY_RE = r"^(?P<ad>0[1-9]|[12][0-9]|3[01])(?P<bs>[-/.])(?P<cm>0[1-9]|1[0-2])(?P<ds>[-/.])(?P<ey>(?:[12][0-9])?[0-9]{2})$"
MDY_RE = r"^(?P<am>0[1-9]|1[0-2])(?P<bs>[-/.])(?P<cd>0[1-9]|[12][0-9]|3[01])(?P<ds>[-/.])(?P<ey>(?:[12][0-9])?[0-9]{2})$"
HMS_RE = r"^(?P<aH>\d{1,2})(?P<bs>:?)(?P<cM>\d{2})(?:(?P<ds>:?)(?P<eS>\d{2}))?(?:(?P<fs>\s)?(?P<ga>[AP]M))?$"


def guess_datetime_format(values: list[str], n=100, return_dict=False):
    di = {}
    for val in values[:n]:
        if val is None:
            continue
        fmts = datetime_formats(val)
        for fmt in fmts:
            if fmt not in di:
                di[fmt] = 1
            di[fmt] += 1

    if len(di) == 0:
        return None

    if return_dict:
        return di

    return max(di, key=di.get)


def datetime_formats(value: str) -> list:
    assert "," not in value  # TODO: handle these cases
    m = re.match(FULL_RE, value)
    dates = "" if m["date"] is None else date_formats(m["date"])
    sep = "" if m["sep"] is None else m["sep"]
    time = "" if m["time"] is None else time_format(m["time"])
    return [date + sep + time for date in dates]


def date_formats(date_value: str) -> list:
    matches = []
    for p in [YMD_RE, MDY_RE, DMY_RE]:
        m = re.match(p, date_value)
        if m is None:
            continue
        fmt = ""
        for c in sorted(m.groupdict().keys()):
            if c[1] == "s":  # separator character
                fmt += "" if m[c] is None else m[c]
            else:  # year, month, day
                fmt += "%" + c[1] if len(m[c]) == 2 else "%Y"
        matches.append(fmt)
    return matches


def time_format(time_value: str) -> str:
    m = re.match(HMS_RE, time_value)
    fmt = ""
    for c in sorted(m.groupdict().keys()):
        if c[1] == "s":  # separator character
            fmt += "" if m[c] is None else m[c]
        else:
            fmt += "" if m[c] is None else "%" + c[1]
    if "M" in time_value:  # AM or PM
        fmt = fmt.replace("%H", "%I")
    return fmt
pietz
  • 2,093
  • 1
  • 21
  • 23
  • Did you compare performance between your regex solution and my solution with cache of the last successful pattern? – Corralien May 31 '23 at 19:05
  • Can you post your benchmark code where you're getting 30x improvement using a python regex instead of using native polars? – Dean MacGregor May 31 '23 at 19:25
  • will post tomorrow. to be clear, the 30x refers to the comparison between both of my solutions. and yes, a combination of regex's is faster than trying a large number of patterns and trying to parse a column with each. – pietz May 31 '23 at 20:39
  • @DeanMacGregor I added a quick benchmark to the repo. On my macbook my approach is 2800x faster than your second alternative using 348 patterns. Corraliens answer gave me errors, so I removed it. The regex approach was also 300x faster than my follow up idea. I guess I missed a magnitude when I mentioned 30x. – pietz Jun 01 '23 at 08:10