Python | Pandas DataFrame: Advanced Slicing/GroupBy

Question

I have been struggling with a pandas quest for a while now and maybe someone can shed some new light into this problem :)

Consider de following pandas dataframe, df:

Year Month Task TaskID TaskClass TaskClassID SomeValue
2019 11    A    1      X         10          6.58
2019 11    A    1      Y         20          1.58
2019 11    B    2      X         10          6.58
2019 11    B    2      Y         20          1.58

objective: group by Task in a way that each Task gets a unique TaskClass observation (which Tasks gets a TaskClass is not important for this problem, can be considered random). like this:

Year Month Task TaskID TaskClass TaskClassID SomeValue
2019 11    A    1      X         10          6.58
2019 11    B    2      Y         20          1.58

or, for instance, this:

Year Month Task TaskID TaskClass TaskClassID SomeValue
2019 11    A    1      Y         20          1.58
2019 11    B    2      X         10          6.58

other constraints the final problema will have thousands of tasks and, more important, can have more TaskClass per Task, something like this:

Year Month Task TaskID TaskClass TaskClassID SomeValue
2019 11    A    1      X         10          6.58
2019 11    A    1      Y         20          1.58
2019 11    A    1      Z         30          1.00
2019 11    A    1      W         40          0.25
2019 11    B    2      X         10          6.58
2019 11    B    2      Y         20          1.58
2019 11    B    2      Z         30          1.00
2019 11    B    2      W         40          0.25

Thank you all, in advance.

score 0 · Answer 1 · answered Feb 19 '20 at 15:20

Why not use drop duplicates?

More here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

Assume a dataframe like so:

 data = pd.DataFrame({
        'Task Class': ['x', 'x', 'y', 'z', 'y', 'z'],
        'Value' : [1, 2, 3, 4, 5, 6],
    })

  Task Class  Value
0          x      1
1          x      2
2          y      3
3          z      4
4          y      5
5          z      6

We can do:

data.drop_duplicates(['Task Class'], inplace=True)

And get:

  Task Class  Value
0          x      1
2          y      3
3          z      4

Python | Pandas DataFrame: Advanced Slicing/GroupBy

1 Answers1