1

My dataframe looks like this:

Country Code Duration
A        1     0
A        1     1
A        1     2
A        1     3
A        2     0
A        2     1
A        1     0
A        1     1
A        1     2

I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:

Country Code Duration
  A      1     3
  A      2     1
  A      1     2

I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.

  • The problem statement is a bit unclear and the provided output is confusing. Are you actually showing the desired output or just the form of your output? If it is the former, please explain how that output is obtained. If it is the latter, please provide an example of the actual desired output. – GZ0 Jun 08 '19 at 23:10
  • @GZ0 that's the desired output. Duration column consists of sequences of values from 0 to n, I need to get only these n values. For instance, the first sequence ranged from 0 to 3, so the first row in the output is 3 for Duration; the 2nd sequence ranged from 0 to 1, the 2nd row in the output is 1; the last sequence ranged from 0 to 2, so the last row in the output is 2 for Duration. – Alex Cranston Jun 08 '19 at 23:23

4 Answers4

2

First we create a mask to mark the sequences. Then we groupby to create the wanted output:

m = (~df['Code'].eq(df['Code'].shift())).cumsum()

df.groupby(m).agg({'Country':'first',
                   'Code':'first',
                   'Duration':'max'}).reset_index(drop=True)

  Country  Code  Duration
0       A     1         3
1       A     2         1
2       A     1         2
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • I think `as_index=False` might be used in the `groupby` call to avoid the `reset_index` call at the end. – GZ0 Jun 08 '19 at 23:53
2

Using idxmax after create another group key by diff and cumsum

df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
  Country  Code  Duration
3       A     1         3
5       A     2         1
8       A     1         2
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You might wanna check this link , it might be the answer you're looking for : pandas groupby where you get the max of one column and the min of another column . It goes as :

result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()
Ayoub ZAROU
  • 2,387
  • 6
  • 20
0

The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.

import pandas as pd
d = pd.read_csv('data.csv')

s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))

Returns:

 Country  Code  series  Duration
0       A     1       1         3
1       A     1       3         2
2       A     2       2         1

You can then drop the series.

Zerodf
  • 2,208
  • 18
  • 26