Filtering DataFrame based on its groups properties

Question

Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)

User can log time multiple times to the same issue
If 2 users log the same time, the are both owners

So we have some sample data:

df = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

As the output we want something like this:

issues_owners = pd.DataFrame([
        [1, 30, 'John'],
        [1, 30, 'Tom'],
        [2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])

Both John and Tom are owners of issue 1, as they both spent 30 minutes on it.
John actually worked on issue 1 on 2 separate days
John is also the owner of the issue 2
Bob is lazy and doesn't own any issues :)

What I came up with feels quite disgusting (I'm relatively new to Python):

df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])

What I dislike about my Python code:

maxTimesPerIssue appears in the middle of processing the df disrupting the thought process (or pipeline)
The creation of maxTimesPerIssue itself is kind of messy
Adding MaxTimePerIssue the df
It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like: reset_index(), list(), dict(), list comprehensions, dropping columns

Can anybody help me clean it up?

score 1 · Accepted Answer · answered Apr 30 '18 at 22:29

1

Something along the lines of a groupby will work for your data:

i = df.groupby(['IssueKey', 'User']).TimeSpent.sum()
j = i.groupby(level=0).transform('max')

i[i == j].reset_index()

   IssueKey  User  TimeSpent
0         1  John         30
1         1   Tom         30
2         2  John         25

answered Apr 30 '18 at 22:29

cs95

379,657
97
704
746

If anyone reading this answer has difficulties understanding the meaning of the `level` (like I did), take a look an my answer to the question about levels: https://stackoverflow.com/a/50110162/672018 – Andrzej Gis May 01 '18 at 01:10
@gisek apologies for the lack of clarity, although you could've just asked :-) I presume you already know by now but I'm leveraging the multiindex in the intermediate output and grouping by the 0th level of that index. – cs95 May 01 '18 at 01:48
I got so excited about how succinct your solution was, that I decided to do my own research instead of bothering you :) – Andrzej Gis May 01 '18 at 02:00
@gisek I'm happy to hear that. This site could use more proactive users who aren't afraid to learn instead of being spoon-fed all the time. Applaud your effort, and good luck to you! – cs95 May 01 '18 at 02:03

Filtering DataFrame based on its groups properties

1 Answers1

Linked