-1

completely new here, I tried looking up my problem but couldn't find anything quite similar!

I'm trying to set up a dataframe that contains the data for a schedule and its activity types. For example, if it's '1' it's a normal activity, and if it's '2' it's canceled, and compare that dataframe to another one to see if there is a date between the start/end date in the first dataframe, and if so, modify it so that it becomes 3 rows instead of 1, having the first Start/End date row until that said holiday, the holiday date row, and then the Start/End date continuing after the holiday.

I have no problem creating a single data frame, however my problem arises when I want to compare another series/data frame and potentially add rows that could be between said StartDate and EndDate.

Example Schedule dataframe

Start Date End Date Activity Type
2021-01-01 2021-12-31 1

When compared to the other dataframe

Holiday Start Date Holiday End Date
2021-02-14 2021-02-14
2021-07-04 2021-07-05

Ending up like this:

Start Date End Date Activity Type
2021-01-01 2021-02-13 1
2021-02-14 2021-02-14 2
2021-02-15 2021-07-03 1
2021-07-04 2021-07-04 2
2021-07-05 2021-12-31 1

Any help is appreciated!

Thanks, S.

  • What is the meaning of *Activity Type == 2*? Your *Schedule dataframe* does not contain *Activity Type == 2*, so maybe it actuaally means "*Activity Type == 2* during a holiday"? – Valdi_Bo Jul 02 '21 at 13:59
  • Activity type 2 would mean that's it's off for that day, so yes during a holiday it would be type 2. – Samsonov22 Jul 02 '21 at 14:22

1 Answers1

0

To present a more instructive example, I created Schedule as containing multiple rows:

  Start Date   End Date  Activity Type
0 2021-01-01 2021-05-31             10
1 2021-06-01 2021-12-31             20

I created Holidays as:

  Holiday Start Date Holiday End Date
0         2021-02-14       2021-02-14
1         2021-03-10       2021-03-12
2         2021-07-04       2021-07-06

All date columns are of datetime64 type.

A preparatory step is to create an IntervalIndex from Holidays:

ind = pd.IntervalIndex.from_arrays(Holidays['Holiday Start Date'],
    Holidays['Holiday End Date'], closed='both')

To get the result from a single row, create the following function:

def getActivities(row):
    dd = pd.date_range(row['Start Date'], row['End Date'])
    ss = dd.to_series().apply(lambda dat: ind.contains(dat).any())
    s1 = ss[ss != ss.shift()]
    s2 = ss[ss != ss.shift(-1)]
    s1 = s1.astype(int) + row['Activity Type']
    rv = s1.astype(int).reset_index().rename(columns={'index': 'Start Date',
        0: 'Activity Type'})
    rv.insert(1, 'End Date', s2.index)
    return rv

To test this function you can call it on a single row, say, the initial row:

getActivities(Schedule.iloc[0])

To understand fully all details, save a single row of Schedule under a variable:

row = Schedule.iloc[0]

Then execute each instruction from getActivities and see the intermediate results.

And to get the expected result for all rows, you have to concatenate results of application of this function to each row:

pd.concat(Schedule.apply(getActivities, axis=1).values, ignore_index=True)

For my test data, the result is:

  Start Date   End Date  Activity Type
0 2021-01-01 2021-02-13             10
1 2021-02-14 2021-02-14             11
2 2021-02-15 2021-03-09             10
3 2021-03-10 2021-03-12             11
4 2021-03-13 2021-05-31             10
5 2021-06-01 2021-07-03             20
6 2021-07-04 2021-07-06             21
7 2021-07-07 2021-12-31             20

Fist 5 rows are from row 0 of Schedule, with 2 holiday periods. Last 3 rows are from row 1, with 1 holiday period.

Note that Activity Type is either the original value (for "normal" period) or the original value + 1 (for a holiday period), so Schedule should not contain consecutive values as Activity Type.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41