3

I have some duration type data (lap times) as pl.Utf8 that fails to convert using strptime, whereas regular datetimes work as expected.

Minutes (before :) and Seconds (before .) are always padded to two digits, Milliseconds are always 3 digits.

Lap times are always < 2 min.

df = pl.DataFrame({
    "lap_time": ["01:14.007", "00:53.040", "01:00.123"]
})

df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.col('lap_time').str.strptime(pl.Time, fmt="%M:%S.%3f").cast(pl.Duration), # fails
    ]
)

So I used the chrono format specifier definitions from https://docs.rs/chrono/latest/chrono/format/strftime/index.html which are used as per the polars docs of strptime

the second conversion (for lap_time) always fails, no matter whether I use .%f, .%3f, %.3f. Apparently, strptime doesn't allow creating a pl.Duration directly, so I tried with pl.Time but it fails with error:

ComputeError: strict conversion to dates failed, maybe set strict=False

but setting strict=False yields all null values for the whole Series.

Am I missing something or this some weird behavior on chrono's or python-polars part?

vvvvv
  • 25,404
  • 19
  • 49
  • 81
Dorian
  • 33
  • 8
  • 2
    I don't think you actually want to parse the duration to a *time* here - what if the duration exceeds 24 hours ?! Instead, you're looking for something like [pandas.to_timedelta](https://pandas.pydata.org/docs/reference/api/pandas.to_timedelta.html). – FObersteiner Mar 06 '23 at 18:28
  • you're right i don't want that, it was just a crutch because `strptime` doesn't take a `pl.Duration` as input. accepted answer figured out the culprit – Dorian Mar 06 '23 at 21:59

3 Answers3

3

General case

In case you have duration that may exceed 24 hours, you can extract data (minutes, seconds and so on) from string using regex pattern. For example:

df = pl.DataFrame({
    "time": ["+01:14.007", "100:20.000", "-05:00.000"]
})

df.with_columns(
    pl.col("time").str.extract_all(r"([+-]?\d+)")
    #                                /
    #                 you will get array of length 3
    #                 ["min", "sec", "ms"]
).with_columns(
    pl.duration(
        minutes=pl.col("time").arr.get(0),
        seconds=pl.col("time").arr.get(1),
        milliseconds=pl.col("time").arr.get(2)
    ).alias("time")
)
┌──────────────┐
│ time         │
│ ---          │
│ duration[ns] │
╞══════════════╡
│ 1m 14s 7ms   │
│ 1h 40m 20s   │
│ -5m          │
└──────────────┘

About pl.Time

To convert data to pl.Time, you need to specify hours as well. When you add 00 hours to your time, code will work:

df = pl.DataFrame({"str_time": ["01:14.007", "01:18.880"]})

df.with_columns(
    duration = (pl.lit("00:") + pl.col("str_time"))\
        .str.strptime(pl.Time, fmt="%T%.3f")\
        .cast(pl.Duration)
)
┌───────────┬──────────────┐
│ str_time  ┆ duration     │
│ ---       ┆ ---          │
│ str       ┆ duration[μs] │
╞═══════════╪══════════════╡
│ 01:14.007 ┆ 1m 14s 7ms   │
│ 01:18.880 ┆ 1m 18s 880ms │
└───────────┴──────────────┘
glebcom
  • 1,131
  • 5
  • 14
  • 1
    that's it, must've missed it in the docs. thank you, will update post with adapted solution – Dorian Mar 06 '23 at 22:01
  • 1
    I should undo my upvote... This is all a hack. Does this solution work for `Duration`'s bigger than 24 hours? I think `pl.Time` cannot be bigger than 24 as it represents time since midnight. Also, the second method assumes that none of the times will be >=100 hours, hence not robust code. Better use `str.extract` with regex: https://pola-rs.github.io/polars/py-polars/html/reference/series/api/polars.Series.str.extract.html – Cornelius Roemer Mar 07 '23 at 14:59
  • As expected both proposed solutions fail for the reasonable generalization: `df = pl.DataFrame({"str_time": ["100:00.000", "-05:00.000"]})` I hope you adapt your answer as this is exactly the type of StackOverflow answer that causes harm because it isn't robust. Also fails if you try to pass a duration of 30 hours like `30:00:00` or negative `-05:00:00` – Cornelius Roemer Mar 07 '23 at 15:28
  • Nice @glebcom - now it should work for this format - can't think of reasonable edge cases. A proper parser would still be better but that's beyond what one can expect from a SO answer :) – Cornelius Roemer Mar 07 '23 at 18:20
1

Create your own parser - strptime works for DateTime stamps only, not for time deltas. The accepted answer is bad practice as it fails for reasonable inputs like durations of 80 minutes, or negative durations.

You can use pl.Series.str.extract() to make your own regex parser and extract the values you want before passing them into the Duration constructor.

As far as I'm aware there is no "duration stamp" parser in Rust. Maybe good idea for a crate if anyone is reading this. Syntax could be similar to strptime but handle cases like: negative duration, non-wrapping for the most significant "digit"/subunit, in this case where it's a "minute duration stamp" you would wrap seconds at 60 but not minutes. Especially making sure that 61 remains 61.

Cornelius Roemer
  • 3,772
  • 1
  • 24
  • 55
0

Code adapted from glebcom's answer:

df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.duration(
            minutes=pl.col("lap_time").str.slice(0,2),
            seconds=pl.col("lap_time").str.slice(3,2),
            milliseconds=pl.col("lap_time").str.slice(6,3)
        ).alias('lap_time'),
    ]
)

This answer was posted as an edit to the question Trouble with strptime() conversion of duration time string by the OP Dorian under CC BY-SA 4.0.

vvvvv
  • 25,404
  • 19
  • 49
  • 81