How can I vectorize a for-loop running over a pandas Periodindex where I need to sort datetimes into appropriate period?

Question

I have a Dataframe "timeseries" which has datetimes as its index and I have a PeriodIndex "on":

import numpy as np
import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

I would like to add a column to "timeseries" that contains the period in "on" that each respective datetime is in:

                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaN
2000-01-03 20:00:00+00:00     5.0         NaN
2000-01-04 13:00:00+00:00     3.0  2000-01-04

So far I have achieved this with a for-loop:

    timeseries["period"] = np.NaN
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index >= period.start_time.tz_localize("UTC"))
            & (timeseries.index <= period.end_time.tz_localize("UTC"))
        ]
        timeseries["period"].loc[datetimes_in_period] = period

For efficiency's sake I want to avoid loops in Python. How can I vectorize this code?

It looks like you may be trying to merge on the day? e.g. `on.to_frame().merge(timeseries.reset_index(), left_on=on.day, right_on=timeseries.index.day, how='left')` - from there you could groupby. — jqurious, Apr 17 '23 at 09:51
It is unclear what you mean by "period". Why exactly are the rows timestamped `2000-01-03 14:00:00+00:00` and `2000-01-03 20:00:00+00:00` not part of any period if a period is not just a day? If I go by that logic I'd say the two rows are part of the period `["2000-01-02", "2000-01-04"]` — rorshan, Apr 17 '23 at 13:02
"period" refers to [pandas.Period](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Period.html). The period `pd.Period('2000-1-1', freq='D')` is the day "first of January 2000". It has a start time of `Timestamp('2000-01-01 00:00:00')` and an end time of `Timestamp('2000-01-01 23:59:59.999999999')`. — James King, Apr 17 '23 at 13:12
The two timestamps you mentioned are not part of a period as March isn't in `on` in the example data — James King, Apr 17 '23 at 13:15

rorshan · Answer 1 · 2023-04-17T13:04:51.783

Note: This answer originally referred to a question which has since been edited to change the intent

Here's my solution:

import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )


merge = (pd.merge_asof(timeseries.index.to_frame(),
                    on.to_timestamp().to_frame(),
                    right_index=True, left_index=True)
                    .drop('0_x', axis=1)
                    .reset_index()
                    .rename({'0_y':'date', 'index':'period'}, axis=1)
        )

#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge

I have never used PeriodIndex before, and was forced to convert it to a DateTimeIndex with to_timestamp. From looking at the documentation it seems that PeriodIndex is intended to create dates/periods programmatically (e.g. every X days between two days), which doesn't seem to be quite what it's being used for here.

Anyway the core of the solution is to use merge_asof which is like merge but instead of requiring equal keys, it will look for the closest key. By default it will look backwards which is what we want (the closest date in on which is before the date in timeseries).

Then we use groupby and agg to get the groups.

We also need to get the dates in on which did not have any match in timeseries (in this case 2000-01-05

Note: you say you "avoid" loops for efficiency. In theory that's a good idea, but be careful that the result you're trying to achieve (having lists as values in a column) is by itself quite inefficient, on top of pandas groupby also being fairly demanding.

Thank you for your answer! This works. I've been thinking about what you wrote as a note and an alternative would be to add a column in timeseries that contains the period that each respective datetime is in. — James King, Apr 17 '23 at 12:05
I have updated the question, but will try to find an approach similar to yours for the updated question. — James King, Apr 17 '23 at 12:19

score 0 · Accepted Answer · answered Apr 17 '23 at 16:34

You can still use .merge_asof as @rorshan suggested.

If you create a dataframe of the start/end intervals:

df_on = pd.DataFrame({
   "period":     on,
   "start_time": on.start_time.tz_localize("UTC"), 
   "end_time":   on.end_time.tz_localize("UTC"),
})

df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")

# blank out period when not inside
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))

>>> df[["value1", "period"]]
                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaT
2000-01-03 20:00:00+00:00     5.0         NaT
2000-01-04 13:00:00+00:00     3.0  2000-01-04

Thank you, this is what I was looking for. I had tried around with merge_asof but couldn't get the blanking out to work. — James King, Apr 18 '23 at 06:38

How can I vectorize a for-loop running over a pandas Periodindex where I need to sort datetimes into appropriate period?

2 Answers2