0

I want to groupby my id column, and then sort by the year column, and then finally collapse the data in such a way that I have a list of years in the years column and a list of another data point in another column while preserving the order.

For example:

We would start with a data frame like

df = pandas.DataFrame( { 
    "id" : [2957, 1456, 2223, 1456, 1456, 2957, 2223, 2957, 2223] , 
    "year" : [2018, 2017, 2018, 2016, 2018, 2016, 2017, 2017, 2016],
    "rating" : [3, 1, 2, 2, 3, 2, 3, 1, 1] }

Which would appear in a data frame like

    id  year    rating
0   2957    2018    3
1   1456    2015    1
2   2223    2018    2
3   1456    2016    2
4   1456    2017    3
5   2957    2016    2
6   2223    2017    3
7   2957    2017    1
8   2223    2016    1

I would then like to sort the data by year ascending, group the data by id, and then condense the data in such a way that there would be a single row for each unique id where the year and rating columns would be collapsed into lists with their orders preserved.

The end result would look something like

    id          year         rating
0   2957 [2016, 2017, 2018] [2,1,3]
1   1456 [2015, 2016, 2017] [1,2,3]
2   2223 [2016, 2017, 2018] [1,3,2]

I understand how to group by and sort by characters, however, I am having trouble condensing the data into lists.

2 Answers2

0

You can use .groupby() + .agg(list):

out = df.sort_values(by="year").groupby("id").agg(list).reset_index()
print(out)

Prints:

     id                year     rating
0  1456  [2015, 2016, 2017]  [1, 2, 3]
1  2957  [2016, 2017, 2018]  [2, 1, 3]
2  2223  [2016, 2017, 2018]  [1, 3, 2]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • You really should be closing these as duplicates. They have been asked many times before... – Nick Aug 28 '23 at 00:09
0

Example

your example has typo

import pandas as pd
df = pd.DataFrame( { 
    "id" : [2957, 1456, 2223, 1456, 1456, 2957, 2223, 2957, 2223] , 
    "year" : [2018, 2017, 2018, 2016, 2018, 2016, 2017, 2017, 2016],
    "rating" : [3, 1, 2, 2, 3, 2, 3, 1, 1]})

Code

use groupby + agg + list

(df.sort_values('year')
   .groupby('id').agg(list)
   .reindex(df['id'].unique())
   .reset_index())

output:

    id      year                rating
0   2957    [2016, 2017, 2018]  [2, 1, 3]
1   1456    [2016, 2017, 2018]  [2, 1, 3]
2   2223    [2016, 2017, 2018]  [1, 3, 2]
Panda Kim
  • 6,246
  • 2
  • 12