1

Let's say we have issue tracker logs and we want to find out issues owners (guys who logged the most time to the issue)

  1. User can log time multiple times to the same issue
  2. If 2 users log the same time, the are both owners

So we have some sample data:

df = pd.DataFrame([
        [1, 10, 'John'],
        [1, 20, 'John'],
        [1, 30, 'Tom'],
        [1, 10, 'Bob'],
        [2, 25, 'John'],
        [2, 15, 'Bob']], columns = ['IssueKey','TimeSpent','User'])

As the output we want something like this:

issues_owners = pd.DataFrame([
        [1, 30, 'John'],
        [1, 30, 'Tom'],
        [2, 25, 'John']], columns = ['IssueKey','TimeSpent','User'])
  1. Both John and Tom are owners of issue 1, as they both spent 30 minutes on it.
  2. John actually worked on issue 1 on 2 separate days
  3. John is also the owner of the issue 2
  4. Bob is lazy and doesn't own any issues :)

What I came up with feels quite disgusting (I'm relatively new to Python):

df = df.groupby(['IssueKey', 'User']).sum().reset_index()
maxTimesPerIssue = df.groupby('IssueKey')['TimeSpent'].max().reset_index()
maxTimesPerIssue = dict(zip(maxTimesPerIssue['IssueKey'], maxTimesPerIssue['TimeSpent']))
df['MaxTimePerIssue'] = [maxTimesPerIssue[key] for key in df['IssueKey']]
df = df[df.MaxTimePerIssue == df.TimeSpent]
df = df.drop(columns=['MaxTimePerIssue'])   

What I dislike about my Python code:

  1. maxTimesPerIssue appears in the middle of processing the df disrupting the thought process (or pipeline)
  2. The creation of maxTimesPerIssue itself is kind of messy
  3. Adding MaxTimePerIssue the df
  4. It's definitely way less self-explanatory than the C# version, due to using lots of low level stuff like: reset_index(), list(), dict(), list comprehensions, dropping columns

Can anybody help me clean it up?

cs95
  • 379,657
  • 97
  • 704
  • 746
Andrzej Gis
  • 13,706
  • 14
  • 86
  • 130

1 Answers1

1

Something along the lines of a groupby will work for your data:

i = df.groupby(['IssueKey', 'User']).TimeSpent.sum()
j = i.groupby(level=0).transform('max')

i[i == j].reset_index()

   IssueKey  User  TimeSpent
0         1  John         30
1         1   Tom         30
2         2  John         25
cs95
  • 379,657
  • 97
  • 704
  • 746
  • If anyone reading this answer has difficulties understanding the meaning of the `level` (like I did), take a look an my answer to the question about levels: https://stackoverflow.com/a/50110162/672018 – Andrzej Gis May 01 '18 at 01:10
  • @gisek apologies for the lack of clarity, although you could've just asked :-) I presume you already know by now but I'm leveraging the multiindex in the intermediate output and grouping by the 0th level of that index. – cs95 May 01 '18 at 01:48
  • I got so excited about how succinct your solution was, that I decided to do my own research instead of bothering you :) – Andrzej Gis May 01 '18 at 02:00
  • @gisek I'm happy to hear that. This site could use more proactive users who aren't afraid to learn instead of being spoon-fed all the time. Applaud your effort, and good luck to you! – cs95 May 01 '18 at 02:03