0

I have a pandas dataframe like this: start end value course

    start     end        value  course
0  2022-01-01 2022-01-01 10     first
1  2022-01-02 2022-01-02 20     first
2  2022-01-05 2022-01-05 30     second
3  2022-01-04 2022-01-04 40     second
4  2022-01-08 2022-01-08 21     first
5  2022-01-09 2022-01-09 92     first
6  2022-01-10 2022-01-10 55     first

What's the best way to group it like this:

    start           end         value       course
0   2022-01-01  2022-01-02  10      first
1   2022-01-04  2022-01-05  30      second
2   2022-01-08  2022-01-10  21      first

There might be more rows with particular course, but the idea is how to group first by the course, and second by one continuous date range? Or maybe it's worth to try to slipt by missing date? The closest case is this one, however it didn't help, since I dont have info about frequency of dates to pass into pd.Grouper(), and I also need to keep start column.

BigBen
  • 46,229
  • 7
  • 24
  • 40

3 Answers3

1

You can create virtual subgroup:

# Convert as DatetimeIndex if necessary
# df['start'] = pd.to_datetime(df['start'])
# df['end'] = pd.to_datetime(df['end'])
​

is_consecutive = lambda x: x['start'].sub(x['end'].shift()).ne('1D')
df['group'] = (df.sort_values(['start', 'end'])
                 .groupby('course').apply(is_consecutive)
                 .cumsum().droplevel('course'))
print(df)

# Output
       start        end  value  course  group
0 2022-01-01 2022-01-01     10   first      1
1 2022-01-02 2022-01-02     20   first      1
2 2022-01-05 2022-01-05     30  second      3
3 2022-01-04 2022-01-04     40  second      3
4 2022-01-08 2022-01-08     21   first      2
5 2022-01-09 2022-01-09     92   first      2
6 2022-01-10 2022-01-10     55   first      2

Now you can do what you want with these groups.

Corralien
  • 109,409
  • 8
  • 28
  • 52
0

A possible solution :

df["start"] = pd.to_datetime(df["start"])
df["end"]   = pd.to_datetime(df["end"])
​
aggs = {"start": "first", "end": "last", "value": "first"}
​
out = (
        df
         .sort_values(by=["course", "start"])
         .assign(next_ = lambda x: x.groupby('course')["start"].shift(-1),
                 group = lambda x: ((x["next_"] != x["end"] + pd.Timedelta(days=1))
                                   |(x["next_"].isna())).cumsum())
         .groupby(["course", "group"], as_index=False).agg(aggs)
         .sort_values(by="start", ignore_index=True).drop(columns="group")
          [["start",  "end", "value", "course"]]
      )

Output :

print(out)

       start        end  value  course
0 2022-01-01 2022-01-02     10   first
1 2022-01-04 2022-01-05     30  second
2 2022-01-08 2022-01-10     21   first
Timeless
  • 22,580
  • 4
  • 12
  • 30
0

A cordial greeting, I hope my answer will be of help to you. The problem as such is that with the current dataframe columns there is no way to group the information as you want, however it is possible to work around this problem.

What you need is to create an ID that identifies each group and with this ID you can group the columns by each group.

This idea is valid only with the assumption that the groups are contiguous courses among themselves and that the order of the dates is not important, this is because in your example the index 2 and 3 the courses are not ordered by column 'start' as is the case for rows with indices 0 and 1 or rows with indices 4, 5 and 6.

With the following line of code you will create a unique id for each group where the comparison between the current line and the superior one will be made and if they are the same, True will be returned and if they are different, False. By making the cumulative sum of the negation of the series of booleans, you can identify each group with a unique id and this ID will auto-increment when this condition is not met.

df['group_id'] = (~df.course.eq(df.course.shift(1))).cumsum() 

With this solved, you can group the groups normally with the following code.

df.groupby(by=['group_id']).agg(
                             start=('start','min'),
                             end=('end','max'),
                             value=('value','min'),
                             course=('course','first')
                         ).reset_index().drop(['group_id'], axis=1)

Greetings.