1

There is a data frame which includes a column of record number (ascending order) and a column of weekdays. The plan is to extract the first and last record numbers of each day. For example:

df = pd.DataFrame({'records': [1, 2, 3, 4, 6, 7, 8, 12, 14, 15, 16, 19, 23, 26, 29, 38, 43, 59, 61],
                   'weekday': ['Monday', 'Monday', 'Monday', 'Tuesday', 'Tuesday', 'Wednesday', 'Thursday',
                               'Thursday', 'Thursday', 'Friday', 'Friday', 'Friday', 'Saturday', 'Sunday',
                               'Monday', 'Monday', 'Tuesday', 'Wednesday', 'Wednesday']})
>>> df

    records    weekday
0         1     Monday
1         2     Monday
2         3     Monday
3         4    Tuesday
4         6    Tuesday
5         7  Wednesday
6         8   Thursday
7        12   Thursday
8        14   Thursday
9        15     Friday
10       16     Friday
11       19     Friday
12       23   Saturday
13       26     Sunday
14       29     Monday
15       38     Monday
16       43    Tuesday
17       59  Wednesday
18       61  Wednesday

I am trying to get something like this:

    first  last  records    weekday
0       1     3        1     Monday
1       1     3        2     Monday
2       1     3        3     Monday
3       4     6        4    Tuesday
4       4     6        6    Tuesday
5       7     7        7  Wednesday
6       8    14        8   Thursday
7       8    14       12   Thursday
8       8    14       14   Thursday
9      15    19       15     Friday
10     15    19       16     Friday
11     15    19       19     Friday
12     23    23       23   Saturday
13     26    26       26     Sunday
14     29    38       29     Monday
15     29    38       38     Monday
16     43    43       43    Tuesday
17     59    61       59  Wednesday
18     59    61       61  Wednesday

So where can I start? Is it the right direction to iterate the weekday column from the top to the bottom while monitoring any change?

Alexander
  • 105,104
  • 32
  • 201
  • 196
user2165
  • 1,951
  • 3
  • 20
  • 39

1 Answers1

1

Using the compare-cumsum-groupby pattern:

df['first'] = (df
               .groupby((df.weekday != df.weekday.shift()).cumsum())
               .records
               .transform('first'))

df['last'] = (df
              .groupby((df.weekday != df.weekday.shift()).cumsum())
              .records
              .transform('last'))    
>>> df
    records    weekday  first  last
0         1     Monday      1     3
1         2     Monday      1     3
2         3     Monday      1     3
3         4    Tuesday      4     6
4         6    Tuesday      4     6
5         7  Wednesday      7     7
6         8   Thursday      8    14
7        12   Thursday      8    14
8        14   Thursday      8    14
9        15     Friday     15    19
10       16     Friday     15    19
11       19     Friday     15    19
12       23   Saturday     23    23
13       26     Sunday     26    26
14       29     Monday     29    38
15       38     Monday     29    38
16       43    Tuesday     43    43
17       59  Wednesday     59    61
18       61  Wednesday     59    61

The trick is to get unique indexes for each weekday (not just 1-7, but incrementing by one each time there is a new weekday).

df['week_counter'] = (df.weekday != df.weekday.shift()).cumsum()
>>> df
    records    weekday  week_counter
0         1     Monday             1
1         2     Monday             1
2         3     Monday             1
3         4    Tuesday             2
4         6    Tuesday             2
5         7  Wednesday             3
6         8   Thursday             4
7        12   Thursday             4
8        14   Thursday             4
...
16       43    Tuesday             9
17       59  Wednesday            10
18       61  Wednesday            10

These week_counter values are then used in groupby to create groups of records, and transorm is used (to maintain the same shape as the original dataframe) taking both the first and last records of each group.

Community
  • 1
  • 1
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Aha, the use of 'shift', together with 'cumsum'. This is BRILLIANT. – user2165 Dec 07 '15 at 01:18
  • Just found it. I originally saw this compare-cumsum-groupby pattern from DSM in this post: http://stackoverflow.com/questions/33130586/python-pandas-creating-a-column-which-keeps-a-running-count-of-consecutive-val/33130915#33130915 – Alexander Dec 07 '15 at 01:21