4

I have a data frame df:

df =

index  date        hats
A1     01-01-2020  5
A1     02-01-2020  10
A1     03-01-2020  16
A1     04-01-2020  16
A1     21-01-2020  9
A1     22-01-2020  8
A1     23-01-2020  7
A6     20-03-2020  5
A6     21-03-2020  5
A8     30-07-2020  12

Here, the first four rows are consecutive days. I want to know the start date and end date of all such consecutive days in the data frame. If there is only one day in a series like wise A8 index in the df then the start date and and end will be same. Moreover, I am also interested in knowing the highest value in df['hats'] column in the series of consecutive days and return its date in a seperate column high_hat along with its date high_hat_date. If there are two or more equal high values in a series of consecutive days then write the number of occurrence of high value in a new column num_hat, and write the first occurrence date in high_hat_date.

The example output for the above data frame is as follows:

index   start_date    end_date    high_hat    high_hat_date   num_hat
A1      01-01-2020    04-01-2020  16          03-01-2020      2
A1      21-01-2020    23-01-2020  9           21-01-2020      1
A6      20-03-2020    21-03-2020  5           20-03-2020      2
A8      30-07-2020    30-07-2020  12          30-07-2020      1     

Any help in this regard is highly appreciated.

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
Gun
  • 556
  • 6
  • 21

2 Answers2

5

First using pd.to_datetime convert the date column to pandas datetime series:

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

Then use:

g = df.groupby('index')['date'].diff().dt.days.ne(1).cumsum() # STEP A
m = df.groupby(['index', g])['hats'].transform('max').eq(df['hats']) # STEP B

df = df.assign(high_hats=df['hats'].mask(~m), high_date=df['date'].mask(~m)) # STEP C

dct = {'start_date': ('date', 'first'), 'end_date': ('date', 'last'), 'high_hat': ('hats', 'max'),
       'high_hat_date': ('high_date', 'first'), 'num_hats': ('high_hats', 'count')}
df1 = df.groupby(['index', g]).agg(**dct).reset_index().drop('date', 1) # STEP D

Details:

STEP A: Use DataFrame.groupby on index and use groupby.diff on date to calculate the the days elapsed between successive dates then use Series.dt.days + Series.ne along with Series.cumsum to create a grouping series g which will be needed to group the dataframe on consecutive dates.

# print(g)
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    3
8    3
9    4
Name: date, dtype: int64

STEP B: Use DataFrame.groupby on index and g and use groupby.transform to transform the column hats using max then using Series.eq equate it with hats column to create a boolean mask m.

# print(m)
0    False
1    False
2     True
3     True
4     True
5    False
6    False
7     True
8     True
9     True
Name: hats, dtype: bool

STEP C: Next use DataFrame.assign to assign two new columns high_hats and high_date which will be used in STEP D for calculation of high_hat_date and num_hats.

# print(df)    
  index       date  hats  high_hats  high_date
0    A1 2020-01-01     5        NaN        NaT
1    A1 2020-01-02    10        NaN        NaT
2    A1 2020-01-03    16       16.0 2020-01-03
3    A1 2020-01-04    16       16.0 2020-01-04
4    A1 2020-01-21     9        9.0 2020-01-21
5    A1 2020-01-22     8        NaN        NaT
6    A1 2020-01-23     7        NaN        NaT
7    A6 2020-03-20     5        5.0 2020-03-20
8    A6 2020-03-21     5        5.0 2020-03-21
9    A8 2020-07-30    12       12.0 2020-07-30

STEP D: Using DataFrame.groupby on index and g and aggregate the dataframe using the aggregation dictionary dct which contains all the columns and their corresponding agg functions to be applied.

# print(df1)
  index start_date   end_date  high_hat high_hat_date  num_hats
0    A1 2020-01-01 2020-01-04        16    2020-01-03         2
1    A1 2020-01-21 2020-01-23         9    2020-01-21         1
2    A6 2020-03-20 2020-03-21         5    2020-03-20         2
3    A8 2020-07-30 2020-07-30        12    2020-07-30         1
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 2
    @ShubhamSharma great code and guide, particularly going to the lengths to hyperlink to the documentation. This would be a good problem and answer for people to study. I need to study this myself :) – David Erickson Jul 14 '20 at 07:36
  • Thanks @DavidErickson yes its a good conceptual problem. – Shubham Sharma Jul 14 '20 at 07:39
2

This required a lot of data transformation:

  1. use np.where(), .shift and .groupby + .transform to create some intermediary columns to calculate the data ranges of the groups.
  2. Create an intermediary dataframe df2 to calculate some more metrics including high_hat_date and num_hat. These calculations require a focus on the max values (i.e. high hat), so it was easier to create this new dataframe this way.
  3. Merge df2 back into df1, take only the required columns and drop duplicate rows

code:

import pandas as pd, numpy as np
df1=df.copy()
df1['date'] = pd.to_datetime(df1['date'], dayfirst=True)
df1['date_diff'] = df1['date'] - df1.shift()['date']
df1['date_first'] = ''
df1['date_first'] = np.where((df1['date_diff'].isnull()) |
                             ((df1['date_diff'] != '1 days') & (df1.shift()['date_diff'] == '1 days')),
                              'start_date', df1['date_first'])
df1['date_first'] = np.where((df1['date_diff'] == '1 days') & (df1.shift(-1)['date_diff'] != '1 days'),
                             'end_date', df1['date_first'])
df1['date_group'] = df1.groupby(df1['date_first'])['date_first'].transform('cumcount')
df1['date_group2'] = df1.groupby(df1['date_first'])['date_group'].transform('cumsum').replace(0,np.nan).ffill().astype(int)
df1['start_date'] = df1.groupby('date_group2')['date'].transform('min')
df1['end_date'] = df1.groupby('date_group2')['date'].transform('max')
df1['high_hat'] = df1.groupby(df1['date_group2'])['hats'].transform('max')
df2 = df1.loc[df1['high_hat'] == df1['hats']]
df2['high_hat_date'] = df2.groupby('date_group2')['date'].transform('first')
df2['num_hat'] = df2.groupby('date_group2')['hats'].transform('count')
df2 = df2.drop_duplicates(subset='date_group2')
df1 = pd.merge(df1, df2[['date_group2', 'high_hat_date', 'num_hat']], how='outer', on=['date_group2'])
df1 = df1[['index', 'start_date', 'end_date', 'high_hat', 'high_hat_date', 'num_hat']].drop_duplicates()
df1

output:

    index   start_date  end_date    high_hat    high_hat_date   num_hat
0   A1      2020-01-01  2020-01-04  16          2020-01-03      2
4   A1      2020-01-21  2020-01-23  9           2020-01-21      1
7   A6      2020-03-20  2020-03-21  5           2020-03-20      2
9   A8      2020-07-30  2020-07-30  12         2020-07-30       1
David Erickson
  • 16,433
  • 2
  • 19
  • 35