6

This is an extension to my question.

To make it simpler Lets suppose I have a pandas dataframe as following.

df = pd.DataFrame([[1.1, 1.1, 2.5, 2.6, 2.5, 3.4,2.6,2.6,3.4], list('AAABBBBAB'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3]]).T
df.columns = ['col1', 'col2','col3']

dataframe :

  col1 col2 col3
0  1.1    A  1.1
1  1.1    A  1.7
2  2.5    A  2.5
3  2.6    B  2.6
4  2.5    B  3.3
5  3.4    B  3.8
6  2.6    B    4
7  2.6    A  4.2
8  3.4    B  4.3

I want to group this based on some conditions. The logic is based on col1 col2 values and the cumulative difference of col3:

  1. Go to col1 and find other occurrences of the same value.
  2. In my case first value of col1 is '1.1' and again their is the same value at row2.
  3. Then check for col2 value, If they are similar, then get the cumulative difference of col 3.
  4. If the cumulative difference is greater than 0.5 then mark this as a new session.
  5. If col1 values are same but col2 values are different then mark them as new session

expected output:

   col1 col2 col3 session
0  1.1    A  1.1  0
1  1.1    A  1.7  1
2  2.5    A  2.5  2
3  2.6    B  2.6  4
4  2.5    B  3.3  3
5  3.4    B  3.8  7
6  2.6    B    4  5
7  2.6    A  4.2  6
8  3.4    B  4.3  7
Community
  • 1
  • 1
Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121

1 Answers1

8

As in the excellent answer you linked to ;) first create the session number:

In [11]: g = df.groupby(['col1', 'col2'])

In [12]: df['session_number'] = g['col3'].apply(lambda s: (s - s.shift(1) > 0.5).fillna(0).cumsum(skipna=False))

Then I think you want to set_index of these columns, this may be enough for many usecases (though it might be worth doing a sort):

In [13]: df1 = df.set_index(['col1', 'col2', 'session_number'])

In [14]: df1
Out[14]:
                         col3
col1 col2 session_number
1.1  A    0               1.1
          1               1.7
2.5  A    0               2.5
2.6  B    0               2.6
2.5  B    0               3.3
3.4  B    0               3.8
2.6  B    1                 4
     A    0               4.2
3.4  B    0               4.3

If you really want you can grab out the session number :

In [15]: g1 = df.groupby(['col1', 'col2', 'session_number'])  # I think there is a slightly neater way, but I forget..

In [16]: df1['session'] = g1.apply(lambda x: 1).cumsum()  # could -1 here if it matters

In [17]: df1
Out[17]:
                         col3  session
col1 col2 session_number
1.1  A    0               1.1        1
          1               1.7        2
2.5  A    0               2.5        3
2.6  B    0               2.6        6
2.5  B    0               3.3        4
3.4  B    0               3.8        8
2.6  B    1                 4        7
     A    0               4.2        5
3.4  B    0               4.3        8

If you want this in columns (as in your question) the reset_index and you could delete the session column:

In [18]: df1.reset_index()
Out[18]:
   col1 col2  session_number col3  session
0   1.1    A               0  1.1        1
1   1.1    A               1  1.7        2
2   2.5    A               0  2.5        3
3   2.6    B               0  2.6        6
4   2.5    B               0  3.3        4
5   3.4    B               0  3.8        8
6   2.6    B               1    4        7
7   2.6    A               0  4.2        5
8   3.4    B               0  4.3        8
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535