0

Say I have a dataframe with that looks like the following:

                     usage_price
2017-04-01 00:00:00            1
2017-04-01 00:30:00            1
2017-04-01 01:00:00            1
2017-04-01 01:30:00            1
2017-04-01 02:00:00            1
...                          ...
2018-12-31 22:00:00            1
2018-12-31 22:30:00            1
2018-12-31 23:00:00            1
2018-12-31 23:30:00            1

What I want to do is update usage_price for particular fields. In my case, I want to update based on this object:

{'day': '1', 'timerange': ['01 01 00:00', '31 12 08:00']}

Which is:

  • Update all Mondays ('day': '1')
  • Between the times 00:00 and 08:00
  • For any Monday between 01-01 (Jan 1st), and 31-12 (Dec 31st) (ignoring year)

I know how to do some of all these things separately:

  • df_timeseries[df_timeseries.index.weekday==0, 'usage_price]
  • df_timeseries.loc[df_timeseries.between_time('00:00', '08:00', include_end=False).index,'usage_price']

But I'm a little stuck on how to get rows between dates (ignoring years), and how to combine everything together - so help would be greatly appreciated!

Edit: This is how far I've managed to get, but I can't seem to get it to work (I'm getting syntax errors), I don't think I'm going about additively building the mask the correct way:

def _create_mask_from_tpr(self, df: pd.DataFrame, tpr: Dict[str, Union[str, List[str]]]) -> Tuple:
    print(tpr)
    weekday = int(tpr['day']) - 1 # Offset.
    start_day, start_month, start_time = tpr['timerange'][0].split(" ")
    end_day, end_month, end_time = tpr['timerange'][1].split(" ")
    start_year, end_year = df.index.min().year, df.index.max().year
    selection_weekday = (df.index.weekday == weekday)
    selection_time = (df.between_time(start_time, end_time))

    selection_date = None
    for year in range(start_year, end_year + 1):
        start_date = pd.to_datetime("{}-{}-{}".format(year, start_month, start_day))
        end_date = pd.to_datetime("{}-{}-{}".format(year, end_month, end_day))
        selection_date = selection_date | (df.index.date() >= start_date & df.index.date() <= end_date)
    mask = (selection_weekday & selection_time & selection_date)
    print(mask)
Darkstarone
  • 4,590
  • 8
  • 37
  • 74

2 Answers2

2

Untested, but something along the following lines could work:

selection = ((df_timeseries.index.weekday == 0) & 
             (df_timeseries.between_time('00:00', '08:00', include_end=False)))
result = df_timeseries[selection, 'usage_price']

Usually, you can combine comparisons with | or & operators (but use parentheses). Since the start and end dates include the full year, I didn't filter on that.

If you want to select on the date, without specyfing a year, you'd run into problems when doing e.g.:ll likely have to do something as follows:

selection = ((df_timeseries.index.day >= 5) &
             (df_timeseries.index.day <= 20) &
             (df_timeseries.index.day >= 2) &
             (df_timeseries.index.day <= 3))

since you would now miss the end of February (days > 20) and start of March (days < 3).

Using df_timeseries.index.dayofyear instead could work, except during a leap year: you'd miss a day at the end of your date-span.

I don't know an easy way to filter on a date-range while ignoring the year. You may have to create a loop through the years of interest, and compare the full year-month-day range for each year, combining each subselection with |. This also serves as another example of chaining more complex selections using | and &:

start = '02-05'
end = '03-02'
subsel = np.zeros(len(df), dtype=np.bool)  # include no dates by default
years = np.range(2018, 2050, dtype=np.int)
for year in years:
    startdate = (pd.to_datetime(str(year) + '-' + start)).date()
    enddate = (pd.to_datetime(str(year) + '-' + end)).date()
    subsel = subsel | (df.index.date >= startdate & df.index.date <= enddate)
selection = selection & subsel
9769953
  • 10,344
  • 3
  • 26
  • 37
  • I've attempted to do this, but I'm not sure the complex chaining is working. I've added my attempt to my question, but I get `selection_date = selection_date | (df.index.date() >= start_date & df.index.date() <= end_date) TypeError: 'numpy.ndarray' object is not callable` – Darkstarone Sep 27 '18 at 12:12
  • @Darkstarone Perhaps reading through https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html may help, especially the "Boolean operators" section. It is focused on NumPy, with it's nearly the same most of the time with Pandas. – 9769953 Sep 27 '18 at 12:25
0

Final solution:

def _create_mask_from_tpr(self, df: pd.DataFrame, tpr: Dict[str, Union[str, List[str]]]) -> List[bool]:
    weekday = int(tpr['day']) - 1 # Offset.
    start_day, start_month, start_time = tpr['timerange'][0].split(" ")
    end_day, end_month, end_time = tpr['timerange'][1].split(" ")
    start_year, end_year = df.index.min().year, df.index.max().year
    selection_weekday = (df.index.weekday == weekday)

    start_time = datetime.datetime.strptime(start_time, '%H:%M').time()
    end_time = datetime.datetime.strptime(end_time, '%H:%M').time()
    selection_time = ((df.index.time >= start_time) & (df.index.time <= end_time))

    selection_date = None
    for year in range(start_year, end_year + 1):
        start_date = pd.Timestamp("{}-{}-{}".format(year, start_month, start_day))
        end_date = pd.Timestamp("{}-{}-{}".format(year, end_month, end_day))
        if selection_date:
            selection_date = selection_date | ((df.index >= start_date) & (df.index <= end_date))
        else:
            selection_date = ((df.index >= start_date) & (df.index <= end_date))
    return (selection_weekday & selection_time & selection_date)
Darkstarone
  • 4,590
  • 8
  • 37
  • 74